[jira] [Updated] (SPARK-24008) SQL/Hive Context fails with NullPointerException

2018-04-17 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated SPARK-24008:
--
Attachment: Repro

> SQL/Hive Context fails with NullPointerException 
> -
>
> Key: SPARK-24008
> URL: https://issues.apache.org/jira/browse/SPARK-24008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: Repro
>
>
> SQL / Hive Context fails with NullPointerException while getting 
> configuration from SQLConf. This happens when the MemoryStore is filled with 
> lot of broadcast and started dropping and then SQL / Hive Context is created 
> and broadcast. When using this Context to access a table fails with below 
> NullPointerException.
> Repro is attached - the Spark Example which fills the MemoryStore with 
> broadcasts and then creates and accesses a SQL Context.
> {code}
> java.lang.NullPointerException
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
> at 
> org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
> at 
> org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362)
> at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
> at SparkHiveExample$.main(SparkHiveExample.scala:76)
> at SparkHiveExample.main(SparkHiveExample.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: 
> java.lang.NullPointerException 
> java.lang.NullPointerException 
> at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) 
> at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) 
> at 
> org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166)
>  
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258)
>  
> at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) 
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) 
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475)
>  
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) 
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) 
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) 
> {code}
> MemoryStore got filled and started dropping the blocks.
> {code}
> 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in 
> memory (estimated size 78.1 MB, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.4 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in 
> memory (estimated size 350.9 KB, free 64.1 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes 
> in memory (estimated size 29.9 KB, free 64.0 MB)
> 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in 
> memory (estimated size 78.1 MB, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes 
> in memory (estimated size 1522.0 B, free 64.7 MB)
> 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 136.0 B, free 64.7 MB)
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
> 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
> 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24008) SQL/Hive Context fails with NullPointerException

2018-04-17 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created SPARK-24008:
-

 Summary: SQL/Hive Context fails with NullPointerException 
 Key: SPARK-24008
 URL: https://issues.apache.org/jira/browse/SPARK-24008
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.3
Reporter: Prabhu Joseph


SQL / Hive Context fails with NullPointerException while getting configuration 
from SQLConf. This happens when the MemoryStore is filled with lot of broadcast 
and started dropping and then SQL / Hive Context is created and broadcast. When 
using this Context to access a table fails with below NullPointerException.

Repro is attached - the Spark Example which fills the MemoryStore with 
broadcasts and then creates and accesses a SQL Context.

{code}
java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
at 
org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
at SparkHiveExample$.main(SparkHiveExample.scala:76)
at SparkHiveExample.main(SparkHiveExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)


18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: 
java.lang.NullPointerException 
java.lang.NullPointerException 
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) 
at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) 
at 
org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166)
 
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258)
 
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) 
at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) 
at 
org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475) 
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) 
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) 
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) 
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) 

{code}

MemoryStore got filled and started dropping the blocks.
{code}
18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in 
memory (estimated size 78.1 MB, free 64.4 MB)
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes 
in memory (estimated size 1522.0 B, free 64.4 MB)
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in 
memory (estimated size 350.9 KB, free 64.1 MB)
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes 
in memory (estimated size 29.9 KB, free 64.0 MB)
18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in 
memory (estimated size 78.1 MB, free 64.7 MB)
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes 
in memory (estimated size 1522.0 B, free 64.7 MB)
18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 136.0 B, free 64.7 MB)
18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared
18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23843) Deploy yarn meets incorrect LOCALIZED_CONF_DIR

2018-04-17 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441930#comment-16441930
 ] 

Saisai Shao commented on SPARK-23843:
-

I think this issue is due to your "new Hadoop-compatible filesystem". I cannot 
reproduce with issue with HDFS. 

> Deploy yarn meets incorrect LOCALIZED_CONF_DIR
> --
>
> Key: SPARK-23843
> URL: https://issues.apache.org/jira/browse/SPARK-23843
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0
> Environment: spark-2.3.0-bin-hadoop2.7
>Reporter: zhoutai.zt
>Priority: Major
>
> We have implement a new Hadoop-compatible filesystem and run spark on it. The 
> commands is:
> {quote}./bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
> yarn --deploy-mode cluster --executor-memory 1G --num-executors 1 
> /home/hadoop/app/spark-2.3.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.0.jar
>  10
> {quote}
> The result is:
> {quote}Exception in thread "main" org.apache.spark.SparkException: 
> Application application_1522399820301_0020 finishe
> d with failed status
>  at org.apache.spark.deploy.yarn.Client.run(Client.scala:1159)
> {quote}
> We set log level to DEBUG and find:
> {quote}2018-04-02 09:36:09,603 DEBUG org.apache.spark.deploy.yarn.Client: 
> __app__.jar -> resource \{ scheme: "dfs" host: 
> "f-63a47d43wh98.cn-neimeng-env10-d01.dfs.aliyuncs.com" port: 10290 file: 
> "/user/hadoop/.sparkStaging/application_1522399820301_0006/spark-examples_2.11-2.3.0.jar"
>  } size: 1997548 timestamp: 1522632978000 type: FILE visibility: PRIVATE
> 2018-04-02 09:36:09,603 DEBUG org.apache.spark.deploy.yarn.Client: 
> __spark_libs__ -> resource \{ scheme: "dfs" host: 
> "f-63a47d43wh98.cn-neimeng-env10-d01.dfs.aliyuncs.com" port: 10290 file: 
> "/user/hadoop/.sparkStaging/application_1522399820301_0006/__spark_libs__924155631753698276.zip"
>  } size: 242801307 timestamp: 1522632977000 type: ARCHIVE visibility: PRIVATE
> 2018-04-02 09:36:09,603 DEBUG org.apache.spark.deploy.yarn.Client: 
> __spark_conf__ -> resource \{ port: -1 file: 
> "/user/hadoop/.sparkStaging/application_1522399820301_0006/__spark_conf__.zip"
>  } size: 185531 timestamp: 1522632978000 type: ARCHIVE visibility: PRIVATE
> {quote}
> As shown, __app__.jar and __spark_libs__ ‘s information are all correct. BUT 
> __spark_conf__ has no port, scheme.
> We explore the source code, addResource appears two times in Client.scala
> {code:java}
> val destPath = copyFileToRemote(destDir, localPath, replication, symlinkCache)
> val destFs = FileSystem.get(destPath.toUri(), hadoopConf)
> distCacheMgr.addResource(
> destFs, hadoopConf, destPath, localResources, resType, linkname, statCache,
> appMasterOnly = appMasterOnly)
> {code}
> {code:java}
>  
> val remoteConfArchivePath = new Path(destDir, LOCALIZED_CONF_ARCHIVE) val 
> remoteFs = FileSystem.get(remoteConfArchivePath.toUri(), hadoopConf) 
> sparkConf.set(CACHED_CONF_ARCHIVE, remoteConfArchivePath.toString()) val 
> localConfArchive = new Path(createConfArchive().toURI()) 
> copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force 
> = true, destName = Some(LOCALIZED_CONF_ARCHIVE)) // Manually add the config 
> archive to the cache manager so that the AM is launched with // the proper 
> files set up. 
> distCacheMgr.addResource( remoteFs, hadoopConf, remoteConfArchivePath, 
> localResources, LocalResourceType.ARCHIVE, LOCALIZED_CONF_DIR, statCache, 
> appMasterOnly = false)
> {code}
> As shown in the source code, the destPaths are differently constructed. And 
> this is confirmed by self added debug log
> {quote}2018-04-02 15:18:46,357 ERROR 
> org.apache.hadoop.yarn.util.ConverterUtils: getYarnUrlFromURI 
> URI:/user/root/.sparkStaging/application_1522399820301_0020/__spark_conf__.zip
> 2018-04-02 15:18:46,357 ERROR org.apache.hadoop.yarn.util.ConverterUtils: 
> getYarnUrlFromURI URL:null; 
> null;-1;null;/user/root/.sparkStaging/application_1522399820301_0020/__spark_conf__.zip{quote}
> Log messages on YARN NM:
> {quote}2018-04-02 09:36:11,958 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Failed to parse resource-request
> java.net.URISyntaxException: Expected scheme name at index 0: 
> :///user/hadoop/.sparkStaging/application_1522399820301_0006/__spark_conf__.zip
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23340) Upgrade Apache ORC to 1.4.3

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441926#comment-16441926
 ] 

Apache Spark commented on SPARK-23340:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/21093

> Upgrade Apache ORC to 1.4.3
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches including bug fixes (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> val df = Seq(Array.empty[Float]).toDF()
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23984) PySpark Bindings for K8S

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23984:


Assignee: Apache Spark

> PySpark Bindings for K8S
> 
>
> Key: SPARK-23984
> URL: https://issues.apache.org/jira/browse/SPARK-23984
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, PySpark
>Affects Versions: 2.3.0
>Reporter: Ilan Filonenko
>Assignee: Apache Spark
>Priority: Major
>
> This ticket is tracking the ongoing work of moving the upsteam work from 
> [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python 
> bindings for Spark on Kubernetes. 
> The points of focus are: dependency management, increased non-JVM memory 
> overhead default values, and modified Docker images to include Python 
> Support. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23984) PySpark Bindings for K8S

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441921#comment-16441921
 ] 

Apache Spark commented on SPARK-23984:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/21092

> PySpark Bindings for K8S
> 
>
> Key: SPARK-23984
> URL: https://issues.apache.org/jira/browse/SPARK-23984
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, PySpark
>Affects Versions: 2.3.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> This ticket is tracking the ongoing work of moving the upsteam work from 
> [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python 
> bindings for Spark on Kubernetes. 
> The points of focus are: dependency management, increased non-JVM memory 
> overhead default values, and modified Docker images to include Python 
> Support. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23984) PySpark Bindings for K8S

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23984:


Assignee: (was: Apache Spark)

> PySpark Bindings for K8S
> 
>
> Key: SPARK-23984
> URL: https://issues.apache.org/jira/browse/SPARK-23984
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, PySpark
>Affects Versions: 2.3.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> This ticket is tracking the ongoing work of moving the upsteam work from 
> [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python 
> bindings for Spark on Kubernetes. 
> The points of focus are: dependency management, increased non-JVM memory 
> overhead default values, and modified Docker images to include Python 
> Support. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object

2018-04-17 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441919#comment-16441919
 ] 

Saisai Shao commented on SPARK-23830:
-

What is the reason to use {{class}} instead of {{object}}, this seems doesn't 
follow our convention about writing a Spark application. I don't think we need 
to support this.

> Spark on YARN in cluster deploy mode fail with NullPointerException when a 
> Spark application is a Scala class not object
> 
>
> Key: SPARK-23830
> URL: https://issues.apache.org/jira/browse/SPARK-23830
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As reported on StackOverflow in [Why does Spark on YARN fail with “Exception 
> in thread ”Driver“ 
> java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344]
>  the following Spark application fails with {{Exception in thread "Driver" 
> java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode:
> {code}
> class MyClass {
>   def main(args: Array[String]): Unit = {
> val c = new MyClass()
> c.process()
>   }
>   def process(): Unit = {
> val sparkConf = new SparkConf().setAppName("my-test")
> val sparkSession: SparkSession = 
> SparkSession.builder().config(sparkConf).getOrCreate()
> import sparkSession.implicits._
> 
>   }
>   ...
> }
> {code}
> The exception is as follows:
> {code}
> 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a 
> separate Thread
> 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context 
> initialization...
> Exception in thread "Driver" java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {code}
> I think the reason for the exception {{Exception in thread "Driver" 
> java.lang.NullPointerException}} is due to [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]:
> {code}
> val mainMethod = userClassLoader.loadClass(args.userClass)
>   .getMethod("main", classOf[Array[String]])
> {code}
> So when {{mainMethod}} is used in [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706]
>  it simply gives NPE.
> {code}
> mainMethod.invoke(null, userArgs.toArray)
> {code}
> That could be easily avoided with an extra check if the {{mainMethod}} is 
> initialized and give a user a message what may have been a reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24001) Multinode cluster

2018-04-17 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441918#comment-16441918
 ] 

Saisai Shao commented on SPARK-24001:
-

Question should go to mail list.

> Multinode cluster 
> --
>
> Key: SPARK-24001
> URL: https://issues.apache.org/jira/browse/SPARK-24001
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Direselign
>Priority: Major
> Attachments: Screenshot from 2018-04-17 22-47-39.png
>
>
> I was trying to configure Apache spark cluster on two ubuntu 16.04 machines 
> using yarn but it is not working when I submit a task to the clusters



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24001) Multinode cluster

2018-04-17 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24001.
-
Resolution: Invalid

> Multinode cluster 
> --
>
> Key: SPARK-24001
> URL: https://issues.apache.org/jira/browse/SPARK-24001
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Direselign
>Priority: Major
> Attachments: Screenshot from 2018-04-17 22-47-39.png
>
>
> I was trying to configure Apache spark cluster on two ubuntu 16.04 machines 
> using yarn but it is not working when I submit a task to the clusters



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24007) EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen.

2018-04-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24007:

Labels: correctness  (was: )

> EqualNullSafe for FloatType and DoubleType might generate a wrong result by 
> codegen.
> 
>
> Key: SPARK-24007
> URL: https://issues.apache.org/jira/browse/SPARK-24007
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: correctness
>
> {{EqualNullSafe}} for {{FloatType}} and {{DoubleType}} might generate a wrong 
> result by codegen.
> {noformat}
> scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF()
> df: org.apache.spark.sql.DataFrame = [_1: double, _2: double]
> scala> df.show()
> +++
> |  _1|  _2|
> +++
> |-1.0|null|
> |null|-1.0|
> +++
> scala> df.filter("_1 <=> _2").show()
> +++
> |  _1|  _2|
> +++
> |-1.0|null|
> |null|-1.0|
> +++
> {noformat}
> The result should be empty but the result remains two rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24007) EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen.

2018-04-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24007:
---

Assignee: Takuya Ueshin

> EqualNullSafe for FloatType and DoubleType might generate a wrong result by 
> codegen.
> 
>
> Key: SPARK-24007
> URL: https://issues.apache.org/jira/browse/SPARK-24007
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> {{EqualNullSafe}} for {{FloatType}} and {{DoubleType}} might generate a wrong 
> result by codegen.
> {noformat}
> scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF()
> df: org.apache.spark.sql.DataFrame = [_1: double, _2: double]
> scala> df.show()
> +++
> |  _1|  _2|
> +++
> |-1.0|null|
> |null|-1.0|
> +++
> scala> df.filter("_1 <=> _2").show()
> +++
> |  _1|  _2|
> +++
> |-1.0|null|
> |null|-1.0|
> +++
> {noformat}
> The result should be empty but the result remains two rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation

2018-04-17 Thread Xianjin YE (JIRA)
Xianjin YE created SPARK-24006:
--

 Summary: ExecutorAllocationManager.onExecutorAdded is an O(n) 
operation
 Key: SPARK-24006
 URL: https://issues.apache.org/jira/browse/SPARK-24006
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Xianjin YE


The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I believe 
it will be a problem when scaling out with large number of Executors as it 
effectively makes adding N executors at time complexity O(N^2).

 

I propose to invoke onExecutorIdle guarded by 
{code:java}
if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { 
// Since we only need to re-remark idle executors when low bound
executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
}{code}
cc [~zsxwing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24007) EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen.

2018-04-17 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-24007:
-

 Summary: EqualNullSafe for FloatType and DoubleType might generate 
a wrong result by codegen.
 Key: SPARK-24007
 URL: https://issues.apache.org/jira/browse/SPARK-24007
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.1, 2.1.2, 2.0.2
Reporter: Takuya Ueshin


{{EqualNullSafe}} for {{FloatType}} and {{DoubleType}} might generate a wrong 
result by codegen.

{noformat}
scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF()
df: org.apache.spark.sql.DataFrame = [_1: double, _2: double]

scala> df.show()
+++
|  _1|  _2|
+++
|-1.0|null|
|null|-1.0|
+++


scala> df.filter("_1 <=> _2").show()
+++
|  _1|  _2|
+++
|-1.0|null|
|null|-1.0|
+++
{noformat}

The result should be empty but the result remains two rows.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation

2018-04-17 Thread Xianjin YE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianjin YE updated SPARK-24006:
---
Description: 
The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I believe 
it will be a problem when scaling out with large number of Executors as it 
effectively makes adding N executors at time complexity O(N^2).

 

I propose to invoke onExecutorIdle guarded by 
{code:java}
if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { 
// Since we only need to re-remark idle executors when low bound
executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
} else {
onExecutorIdle(executorId)
}{code}
cc [~zsxwing]

  was:
The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I believe 
it will be a problem when scaling out with large number of Executors as it 
effectively makes adding N executors at time complexity O(N^2).

 

I propose to invoke onExecutorIdle guarded by 
{code:java}
if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { 
// Since we only need to re-remark idle executors when low bound
executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
}{code}
cc [~zsxwing]


> ExecutorAllocationManager.onExecutorAdded is an O(n) operation
> --
>
> Key: SPARK-24006
> URL: https://issues.apache.org/jira/browse/SPARK-24006
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Xianjin YE
>Priority: Major
>
> The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I 
> believe it will be a problem when scaling out with large number of Executors 
> as it effectively makes adding N executors at time complexity O(N^2).
>  
> I propose to invoke onExecutorIdle guarded by 
> {code:java}
> if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { 
> // Since we only need to re-remark idle executors when low bound
> executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle)
> } else {
> onExecutorIdle(executorId)
> }{code}
> cc [~zsxwing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-17 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441906#comment-16441906
 ] 

Saisai Shao commented on SPARK-23989:
-

Please provide a reproducible case.

Did you reuse the object in your code? I think in the Spark side we already 
handled such case.

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Critical
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23982) NoSuchMethodException: There is no startCredentialUpdater method in the object YarnSparkHadoopUtil

2018-04-17 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441881#comment-16441881
 ] 

Saisai Shao commented on SPARK-23982:
-

This method should be existed. Would you please paste your exception stack here 
(if you really met the issue)?

> NoSuchMethodException: There is no startCredentialUpdater method in the 
> object YarnSparkHadoopUtil
> --
>
> Key: SPARK-23982
> URL: https://issues.apache.org/jira/browse/SPARK-23982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: John
>Priority: Major
>
>  In the 219 line of the CoarseGrainedExecutorBackend class:
> Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil").getMethod("startCredentialUpdater",
>  classOf[SparkConf]).invoke(null, driverConf)
> But, There is no startCredentialUpdater method in the object 
> YarnSparkHadoopUtil.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7132) Add fit with validation set to spark.ml GBT

2018-04-17 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441864#comment-16441864
 ] 

Weichen Xu commented on SPARK-7132:
---

I dicussed with [~josephkb] and paste the proposal on JIRA.  [~yanboliang] Do 
you agree with it or do you have other thoughts ?

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.
> Goals
> A  [P0] Support efficient validation during training
> B  [P1] Support early stopping based on validation metrics
> C  [P0] Ensure validation data are preprocessed identically to training data
> D  [P1] Support complex Pipelines with multiple models using validation data
> Proposal: column with indicator for train vs validation
> Include an extra column in the input DataFrame which indicates whether the 
> row is for training or validation.  Add a Param “validationFlagCol” used to 
> specify the extra column name.
> A, B, C are easy.
> D is doable.
> Each estimator would need to have its validationFlagCol Param set to the same 
> column.
> Complication: It would be ideal if we could prevent different estimators from 
> using different validation sets.  (Joseph: There is not an obvious way IMO.  
> Maybe we can address this later by, e.g., having Pipelines take a 
> validationFlagCol Param and pass that to the sub-models in the Pipeline.  
> Let’s not worry about this for now.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7132) Add fit with validation set to spark.ml GBT

2018-04-17 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-7132:
--
Description: 
In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
takes a validation set.  We should add that to the spark.ml API.

This will require a bit of thinking about how the Pipelines API should handle a 
validation set (since Transformers and Estimators only take 1 input DataFrame). 
 The current plan is to include an extra column in the input DataFrame which 
indicates whether the row is for training, validation, etc.

Goals
A  [P0] Support efficient validation during training
B  [P1] Support early stopping based on validation metrics
C  [P0] Ensure validation data are preprocessed identically to training data
D  [P1] Support complex Pipelines with multiple models using validation data


Proposal: column with indicator for train vs validation
Include an extra column in the input DataFrame which indicates whether the row 
is for training or validation.  Add a Param “validationFlagCol” used to specify 
the extra column name.

A, B, C are easy.
D is doable.
Each estimator would need to have its validationFlagCol Param set to the same 
column.
Complication: It would be ideal if we could prevent different estimators from 
using different validation sets.  (Joseph: There is not an obvious way IMO.  
Maybe we can address this later by, e.g., having Pipelines take a 
validationFlagCol Param and pass that to the sub-models in the Pipeline.  Let’s 
not worry about this for now.)


  was:
In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
takes a validation set.  We should add that to the spark.ml API.

This will require a bit of thinking about how the Pipelines API should handle a 
validation set (since Transformers and Estimators only take 1 input DataFrame). 
 The current plan is to include an extra column in the input DataFrame which 
indicates whether the row is for training, validation, etc.


> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.
> Goals
> A  [P0] Support efficient validation during training
> B  [P1] Support early stopping based on validation metrics
> C  [P0] Ensure validation data are preprocessed identically to training data
> D  [P1] Support complex Pipelines with multiple models using validation data
> Proposal: column with indicator for train vs validation
> Include an extra column in the input DataFrame which indicates whether the 
> row is for training or validation.  Add a Param “validationFlagCol” used to 
> specify the extra column name.
> A, B, C are easy.
> D is doable.
> Each estimator would need to have its validationFlagCol Param set to the same 
> column.
> Complication: It would be ideal if we could prevent different estimators from 
> using different validation sets.  (Joseph: There is not an obvious way IMO.  
> Maybe we can address this later by, e.g., having Pipelines take a 
> validationFlagCol Param and pass that to the sub-models in the Pipeline.  
> Let’s not worry about this for now.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23341) DataSourceOptions should handle path and table names to avoid confusion.

2018-04-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23341:
---

Assignee: Wenchen Fan

> DataSourceOptions should handle path and table names to avoid confusion.
> 
>
> Key: SPARK-23341
> URL: https://issues.apache.org/jira/browse/SPARK-23341
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceOptions should have getters for path and table, which should be 
> passed in when creating the options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23341) DataSourceOptions should handle path and table names to avoid confusion.

2018-04-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23341.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20535
[https://github.com/apache/spark/pull/20535]

> DataSourceOptions should handle path and table names to avoid confusion.
> 
>
> Key: SPARK-23341
> URL: https://issues.apache.org/jira/browse/SPARK-23341
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceOptions should have getters for path and table, which should be 
> passed in when creating the options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24000) S3A: Create Table should fail on invalid AK/SK

2018-04-17 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440841#comment-16440841
 ] 

Brahma Reddy Battula edited comment on SPARK-24000 at 4/18/18 3:41 AM:
---

Discussed [~ste...@apache.org] with offline, we feel , create table code should 
be thinking of doing a getFileStatus() on the path to see what's there and so 
fail on any no-permissions situation.

 Please forgive me,if I am not clear. I am a newbie to spark.

 

[~ste...@apache.org] can you please pitch in here.?


was (Author: brahmareddy):
Discussed [~ste...@apache.org] with offline, create table code should be 
thinking of doing a getFileStatus() on the path to see what's there and so fail 
on any no-permissions situation.

 

Please forgive me,if I am not clear. I am a newbie to spark.

> S3A: Create Table should fail on invalid AK/SK
> --
>
> Key: SPARK-24000
> URL: https://issues.apache.org/jira/browse/SPARK-24000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Brahma Reddy Battula
>Priority: Major
>
> Currently, When we pass the i{color:#FF}nvalid ak&{color} *create 
> table* will be the *success*.
> when the S3AFileSystem is initialized, *verifyBucketExists*() is called, 
> which will return *True* as the status code 403 
> (*_BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)_*  _from following as bucket exists._
> {code:java}
> public boolean doesBucketExist(String bucketName)
>  throws AmazonClientException, AmazonServiceException {
>  
>  try {
>      headBucket(new HeadBucketRequest(bucketName));
>  return true;
>      } catch (AmazonServiceException ase) {
>  // A redirect error or a forbidden error means the bucket exists. So
>      // returning true.
>  if ((ase.getStatusCode() == Constants.BUCKET_REDIRECT_STATUS_CODE)
>      || (ase.getStatusCode() == 
> Constants.BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)) {
>  return true;
>      }
>      if (ase.getStatusCode() == Constants.NO_SUCH_BUCKET_STATUS_CODE) {
>  return false;
>      }
>  throw ase;
>  
>      }
>  }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22676) Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441849#comment-16441849
 ] 

Apache Spark commented on SPARK-22676:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/21091

> Avoid iterating all partition paths when 
> spark.sql.hive.verifyPartitionPath=true
> 
>
> Key: SPARK-22676
> URL: https://issues.apache.org/jira/browse/SPARK-22676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.4.0
>
>
> In current code, it will scanning all partition paths when 
> spark.sql.hive.verifyPartitionPath=true.
> e.g. table like below:
> CREATE TABLE `test`(
>   `id` int,
>   `age` int,
>   `name` string)
> PARTITIONED BY (
>   `A` string,
>   `B` string)
> load data local inpath '/tmp/data1' into table test partition(A='00', B='00')
> load data local inpath '/tmp/data1' into table test partition(A='01', B='01')
> load data local inpath '/tmp/data1' into table test partition(A='10', B='10')
> load data local inpath '/tmp/data1' into table test partition(A='11', B='11')
> If I query with SQL -- "select * from test where year=2017 and month=12 and 
> day=03", current code will scan all partition paths including 
> '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', 
> '/data/A=11/B=11'. It costs much time and memory cost. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-17 Thread Jordan Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441826#comment-16441826
 ] 

Jordan Moore commented on SPARK-18057:
--

Based on my Github searches, looks like it's Spark 2.1.1, possibly upgradable 
to at most 2.2.0 based on the parent POM available. 

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-17 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441818#comment-16441818
 ] 

Cody Koeninger commented on SPARK-18057:


Ok, if you can figure out what version of spark it is I can help w getting a 
branch with the updated dependency.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-17 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441802#comment-16441802
 ] 

Richard Yu edited comment on SPARK-18057 at 4/18/18 2:46 AM:
-

Kafka contributors / developers are currently still discussing over 
[KIP-266|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75974886],
 so we don't have any working models for the hanging position() calls just yet.


was (Author: yohan123):
Kafka contributors / developers are currently still discussing over KIP-266, so 
we don't have any working models for the hanging position() calls just yet.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-17 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441802#comment-16441802
 ] 

Richard Yu commented on SPARK-18057:


Kafka contributors / developers are currently still discussing over KIP-266, so 
we don't have any working models for the hanging position() calls just yet.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-17 Thread Jordan Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441803#comment-16441803
 ] 

Jordan Moore commented on SPARK-18057:
--

{quote}probably won't work
{quote}
I figured as much. 

Nope, not actively deleting topics myself. I can't control what others are 
doing, of course. 
{quote}[change] just the app assembly ... [I] can try out
{quote}
I should mention that I am not the actual one running the code, otherwise I 
would offer more logs. 
I work on the infrastructure team, and I believe the other team with the code 
is using Qubole's cloud managed Spark environment or possibly EMR. I know where 
the assembly jar is on our HDFS cluster, not really sure how that all is 
managed in AWS. I doubt it's running Spark 2.3.0, though. 

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21479) Outer join filter pushdown in null supplying table when condition is on one of the joined columns

2018-04-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21479:
---

Assignee: Maryann Xue

> Outer join filter pushdown in null supplying table when condition is on one 
> of the joined columns
> -
>
> Key: SPARK-21479
> URL: https://issues.apache.org/jira/browse/SPARK-21479
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.4.0
>
>
> Here are two different query plans - 
> {code:java}
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("b = 2").explain()
> == Physical Plan ==
> *Project [a#16299L, b#16295L, c#16300L]
> +- *SortMergeJoin [a#16294L], [a#16299L], Inner
>:- *Sort [a#16294L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16294L, 4)
>: +- *Filter ((isnotnull(b#16295L) && (b#16295L = 2)) && 
> isnotnull(a#16294L))
>:+- Scan ExistingRDD[a#16294L,b#16295L]
>+- *Sort [a#16299L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16299L, 4)
>  +- *Filter isnotnull(a#16299L)
> +- Scan ExistingRDD[a#16299L,c#16300L]
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("a = 1").explain()
> == Physical Plan ==
> *Project [a#16314L, b#16310L, c#16315L]
> +- SortMergeJoin [a#16309L], [a#16314L], RightOuter
>:- *Sort [a#16309L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16309L, 4)
>: +- Scan ExistingRDD[a#16309L,b#16310L]
>+- *Sort [a#16314L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16314L, 4)
>  +- *Filter (isnotnull(a#16314L) && (a#16314L = 1))
> +- Scan ExistingRDD[a#16314L,c#16315L]
> {code}
> If condition on b can be pushed down on df1 then why not condition on a?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21479) Outer join filter pushdown in null supplying table when condition is on one of the joined columns

2018-04-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21479.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20816
[https://github.com/apache/spark/pull/20816]

> Outer join filter pushdown in null supplying table when condition is on one 
> of the joined columns
> -
>
> Key: SPARK-21479
> URL: https://issues.apache.org/jira/browse/SPARK-21479
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>Priority: Major
> Fix For: 2.4.0
>
>
> Here are two different query plans - 
> {code:java}
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("b = 2").explain()
> == Physical Plan ==
> *Project [a#16299L, b#16295L, c#16300L]
> +- *SortMergeJoin [a#16294L], [a#16299L], Inner
>:- *Sort [a#16294L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16294L, 4)
>: +- *Filter ((isnotnull(b#16295L) && (b#16295L = 2)) && 
> isnotnull(a#16294L))
>:+- Scan ExistingRDD[a#16294L,b#16295L]
>+- *Sort [a#16299L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16299L, 4)
>  +- *Filter isnotnull(a#16299L)
> +- Scan ExistingRDD[a#16299L,c#16300L]
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("a = 1").explain()
> == Physical Plan ==
> *Project [a#16314L, b#16310L, c#16315L]
> +- SortMergeJoin [a#16309L], [a#16314L], RightOuter
>:- *Sort [a#16309L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16309L, 4)
>: +- Scan ExistingRDD[a#16309L,b#16310L]
>+- *Sort [a#16314L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16314L, 4)
>  +- *Filter (isnotnull(a#16314L) && (a#16314L = 1))
> +- Scan ExistingRDD[a#16314L,c#16315L]
> {code}
> If condition on b can be pushed down on df1 then why not condition on a?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-17 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441793#comment-16441793
 ] 

Cody Koeninger commented on SPARK-18057:


Just adding the extra dependency on 0.11 probably won't work, since there were 
some actual code changes not just a dependency bump.  As long as you aren't 
deleting topics, the blocker KAFKA-4879 issue shouldn't affect you trying out a 
branch with the changes, and it wouldn't require changing your spark 
deployment, just the app assembly.

[~sliebau] [~Yohan123] do you have an up to date branch of spark with the kafka 
client dependency change that Jordan can try out?  If not I can make one.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-04-17 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-22968.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21038
[https://github.com/apache/spark/pull/21038]

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.4.0
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> 

[jira] [Assigned] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-04-17 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger reassigned SPARK-22968:
--

Assignee: Saisai Shao

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>Assignee: Saisai Shao
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>   at scala.util.Try$.apply(Try.scala:192)
>

[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-17 Thread Jordan Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441718#comment-16441718
 ] 

Jordan Moore edited comment on SPARK-18057 at 4/18/18 12:53 AM:


Hello Cody, 

No it is not. And no other specific configurations are applied to the topic. I 
have reason to believe the problem is not Spark specific (other than maybe the 
committed offset management by a producer).

On the consumer side, it seems to be an issue in just the Java client because 
that offset output given is from a plain Java consumer. All versions of 
kafka-clients 0.10.x demonstrated the same behavior. 


was (Author: cricket007):
Hello Cody, 

No it is not. And no other specific configurations are applied to the topic. I 
have reason to believe the problem is not Spark specific. It is an issue in 
just the Java client because that offset output given is from a plain Java 
consumer. All versions of kafka-clients 0.10.x demonstrated the same behavior. 

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-17 Thread Jordan Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441718#comment-16441718
 ] 

Jordan Moore commented on SPARK-18057:
--

Hello Cody, 

No it is not. And no other specific configurations are applied to the topic. I 
have reason to believe the problem is not Spark specific. It is an issue in 
just the Java client because that offset output given is from a plain Java 
consumer. All versions of kafka-clients 0.10.x demonstrated the same behavior. 

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data

2018-04-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441713#comment-16441713
 ] 

Joseph K. Bradley commented on SPARK-18693:
---

[~imatiach] Would you mind creating JIRA subtasks so that we have 1 PR per 
JIRA?  That helps with tracking.  Thanks!

> BinaryClassificationEvaluator, RegressionEvaluator, and 
> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-18693
> URL: https://issues.apache.org/jira/browse/SPARK-18693
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Devesh Parekh
>Priority: Major
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23990) Instruments logging improvements - ML regression package

2018-04-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441701#comment-16441701
 ] 

Joseph K. Bradley commented on SPARK-23990:
---

A complication was brought up by this PR: Some logging occurs in classes which 
are not Estimators (WeightedLeastSquares, IterativelyReweightedLeastSquares) 
and in static objects (RandomForest, GradientBoostedTrees).  These may have an 
Instrumentation instance available (when used from an Estimator) or may not 
(when used in a unit test).  Options include:
1. Make these require Instrumentation instances.  This would require slightly 
awkward changes to unit tests.
2. Create something similar to Instrumentation or Logging which can store an 
Optional Instrumentation instance.  If the Instrumentation is available, it can 
log via that; otherwise, it can call into regular Logging.
2a. This could be a trait like Logging.  This is nice in that it requires fewer 
changes to existing logging code.
2b. This could be a class like Instrumentation.  This is nice in that it 
standardizes all of MLlib around Instrumentation instead of Logging.

I'd vote for 2b to standardize what we do in MLlib.  Thoughts?

> Instruments logging improvements - ML regression package
> 
>
> Key: SPARK-23990
> URL: https://issues.apache.org/jira/browse/SPARK-23990
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
> Environment: Instruments logging improvements - ML regression package
>Reporter: Weichen Xu
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-17 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441606#comment-16441606
 ] 

Cody Koeninger commented on SPARK-18057:


Out of curiosity, was that a compacted topic?



> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24005) Remove usage of Scala’s parallel collection

2018-04-17 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24005:
---

 Summary: Remove usage of Scala’s parallel collection
 Key: SPARK-24005
 URL: https://issues.apache.org/jira/browse/SPARK-24005
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


{noformat}
val par = (1 to 100).par.flatMap { i =>
  Thread.sleep(1000)
  1 to 1000
}.toSeq
{noformat}

We are unable to interrupt the execution of parallel collections. We need to 
create a common utility function to do it, instead of using Scala parallel 
collections



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-17 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23948:
-
Fix Version/s: 2.3.1

> Trigger mapstage's job listener in submitMissingTasks
> -
>
> Key: SPARK-23948
> URL: https://issues.apache.org/jira/browse/SPARK-23948
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
> "markMapStageJobAsFinished" is called only in 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
>  and   
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);
> But think about below scenario:
> 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
> 2. We submit stage1 by "submitMapStage";
> 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
> resubmitted as stage0_1 and stage1_1;
> 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
> but stage1 is not inside "runningStages". So even though all splits(including 
> the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
> called;
> 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", 
> there is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-04-17 Thread Jordan Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441485#comment-16441485
 ] 

Jordan Moore commented on SPARK-18057:
--

Hi all, chiming in here to point out a production issue we are currently 
seeing. 

We recently upgraded from Confluent 3.1.2 (Kafka 0.10.1.1) to Confluent 3.3.1 
(Kafka 0.11.0.1), and seeing messages such as 
{code:java}
Tried to fetch 473151075 but the returned record offset was 473151072{code}
Full Stacktrace below

So, looking into the raw topic with a 0.10.x Java consumer, we see that there 
are duplicated offsets (see those ending in 72-74 below), however when we 
deserialize the messages, the record values are actually different. 
{code:java}
offset = 473151070, timestamp=1523743598312, data=[B@5ae9a829
offset = 473151071, timestamp=1523743598211, data=[B@6d8a00e3
offset = 473151072, timestamp=1523743598213, data=[B@548b7f67
offset = 473151073, timestamp=1523743598215, data=[B@7ac7a4e4
offset = 473151074, timestamp=1523743598423, data=[B@6d78f375
offset = 473151072, timestamp=1523743598831, data=[B@50c87b21
offset = 473151073, timestamp=1523743598837, data=[B@5f375618
offset = 473151074, timestamp=1523743599017, data=[B@1810399e
offset = 473151075, timestamp=1523743599020, data=[B@32d992b2
offset = 473151076, timestamp=1523743599710, data=[B@215be6bb
offset = 473151077, timestamp=1523743599714, data=[B@4439f31e{code}
Running the same simple consumer with at least Kafka 0.11.x libraries *fixed* 
the issue. 

So, my question here is - even if there isn't a big push towards jumping all 
the way onto 1.1.0, what about simply upgrading to at least 0.11? Or how safe 
is it to just add  {{org.apache.kafka:kafka-clients:0.11.0.0 }}?

 

Full Stacktrace...
{code:java}
GScheduler: ResultStage 0 (start at SparkStreamingTask.java:222) failed in 
77.546 s due to Job aborted due to stage failure: Task 86 in stage 0.0 failed 4 
times, most recent failure: Lost task 86.3 in stage 0.0 (TID 96, 
ip-10-120-12-52.ec2.internal, executor 11): java.lang.IllegalStateException: 
Tried to fetch 473151075 but the returned record offset was 473151072
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.fetchData(CachedKafkaConsumer.scala:234)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:158)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:149)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at 
com.domain.spark.KafkaConnectorTask.lambda$run$97bfadb$1(KafkaConnectorTask.java:81)
at org.apache.spark.sql.Dataset$$anonfun$48.apply(Dataset.scala:2269)
at org.apache.spark.sql.Dataset$$anonfun$48.apply(Dataset.scala:2269)
at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:196)
at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:193)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 

[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441448#comment-16441448
 ] 

Apache Spark commented on SPARK-15784:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/21090

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

2018-04-17 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441378#comment-16441378
 ] 

Bruce Robbins commented on SPARK-23963:
---

[~Tagar] Yes, although I am a little fuzzy on the process for back porting. I 
will try to find out.

> Queries on text-based Hive tables grow disproportionately slower as the 
> number of columns increase
> --
>
> Key: SPARK-23963
> URL: https://issues.apache.org/jira/browse/SPARK-23963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.4.0
>
>
> TableReader gets disproportionately slower as the number of columns in the 
> query increase.
> For example, reading a table with 6000 columns is 4 times more expensive per 
> record than reading a table with 3000 columns, rather than twice as expensive.
> The increase in processing time is due to several Lists (fieldRefs, 
> fieldOrdinals, and unwrappers), each of which the reader accesses by column 
> number for each column in a record. Because each List has O\(n\) time for 
> lookup by column number, these lookups grow increasingly expensive as the 
> column count increases.
> When I patched the code to change those 3 Lists to Arrays, the query times 
> became proportional.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24004) Tests of from_json for MapType

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441369#comment-16441369
 ] 

Apache Spark commented on SPARK-24004:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21089

> Tests of from_json for MapType
> --
>
> Key: SPARK-24004
> URL: https://issues.apache.org/jira/browse/SPARK-24004
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> There are no tests for *from_json* that check *MapType* as a value type of 
> struct fields. The MapType should be supported as non-root type according to 
> current implementation of JacksonParser but the functionality is not checked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24004) Tests of from_json for MapType

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24004:


Assignee: Apache Spark

> Tests of from_json for MapType
> --
>
> Key: SPARK-24004
> URL: https://issues.apache.org/jira/browse/SPARK-24004
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Trivial
>
> There are no tests for *from_json* that check *MapType* as a value type of 
> struct fields. The MapType should be supported as non-root type according to 
> current implementation of JacksonParser but the functionality is not checked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24004) Tests of from_json for MapType

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24004:


Assignee: (was: Apache Spark)

> Tests of from_json for MapType
> --
>
> Key: SPARK-24004
> URL: https://issues.apache.org/jira/browse/SPARK-24004
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> There are no tests for *from_json* that check *MapType* as a value type of 
> struct fields. The MapType should be supported as non-root type according to 
> current implementation of JacksonParser but the functionality is not checked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24004) Tests of from_json for MapType

2018-04-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24004:
--

 Summary: Tests of from_json for MapType
 Key: SPARK-24004
 URL: https://issues.apache.org/jira/browse/SPARK-24004
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


There are no tests for *from_json* that check *MapType* as a value type of 
struct fields. The MapType should be supported as non-root type according to 
current implementation of JacksonParser but the functionality is not checked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-04-17 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441353#comment-16441353
 ] 

Miao Wang commented on SPARK-15784:
---

[~josephkb] You can start the new PR now. :)

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24003) Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24003:


Assignee: (was: Apache Spark)

> Add support to provide spark.executor.extraJavaOptions in terms of App Id 
> and/or Executor Id's
> --
>
> Key: SPARK-24003
> URL: https://issues.apache.org/jira/browse/SPARK-24003
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Devaraj K
>Priority: Major
>
> Users may want to enable gc logging or heap dump for the executors, but there 
> is a chance of overwriting it by other executors since the paths cannot be 
> expressed dynamically. This improvement would enable to express the 
> spark.executor.extraJavaOptions paths in terms of App Id and Executor Id's to 
> avoid the overwriting by other executors.
> There was a discussion about this in SPARK-3767, but it never fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24003) Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24003:


Assignee: Apache Spark

> Add support to provide spark.executor.extraJavaOptions in terms of App Id 
> and/or Executor Id's
> --
>
> Key: SPARK-24003
> URL: https://issues.apache.org/jira/browse/SPARK-24003
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Devaraj K
>Assignee: Apache Spark
>Priority: Major
>
> Users may want to enable gc logging or heap dump for the executors, but there 
> is a chance of overwriting it by other executors since the paths cannot be 
> expressed dynamically. This improvement would enable to express the 
> spark.executor.extraJavaOptions paths in terms of App Id and Executor Id's to 
> avoid the overwriting by other executors.
> There was a discussion about this in SPARK-3767, but it never fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24003) Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441328#comment-16441328
 ] 

Apache Spark commented on SPARK-24003:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/21088

> Add support to provide spark.executor.extraJavaOptions in terms of App Id 
> and/or Executor Id's
> --
>
> Key: SPARK-24003
> URL: https://issues.apache.org/jira/browse/SPARK-24003
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Devaraj K
>Priority: Major
>
> Users may want to enable gc logging or heap dump for the executors, but there 
> is a chance of overwriting it by other executors since the paths cannot be 
> expressed dynamically. This improvement would enable to express the 
> spark.executor.extraJavaOptions paths in terms of App Id and Executor Id's to 
> avoid the overwriting by other executors.
> There was a discussion about this in SPARK-3767, but it never fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-04-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22884:
--
Shepherd: Joseph K. Bradley

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24003) Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's

2018-04-17 Thread Devaraj K (JIRA)
Devaraj K created SPARK-24003:
-

 Summary: Add support to provide spark.executor.extraJavaOptions in 
terms of App Id and/or Executor Id's
 Key: SPARK-24003
 URL: https://issues.apache.org/jira/browse/SPARK-24003
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Spark Core, YARN
Affects Versions: 2.3.0
Reporter: Devaraj K


Users may want to enable gc logging or heap dump for the executors, but there 
is a chance of overwriting it by other executors since the paths cannot be 
expressed dynamically. This improvement would enable to express the 
spark.executor.extraJavaOptions paths in terms of App Id and Executor Id's to 
avoid the overwriting by other executors.

There was a discussion about this in SPARK-3767, but it never fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-17 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441268#comment-16441268
 ] 

Kazuaki Ishizaki commented on SPARK-23933:
--

ping [~smilegator]

> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2018-04-17 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441257#comment-16441257
 ] 

Carlos Bribiescas commented on SPARK-21063:
---

Any update or workarounds for this?

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>Priority: Major
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel

2018-04-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8799:
-
Shepherd: Joseph K. Bradley

> OneVsRestModel should extend ClassificationModel
> 
>
> Key: SPARK-8799
> URL: https://issues.apache.org/jira/browse/SPARK-8799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. 
> For example:
>  * `accColName` can be used to populate `ClassificationModel#predictRaw` and 
> share implementations of `transform`
>  * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be 
> gotten for free through subclassing
> `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) 
> because the labels for a `OneVsRest` will always be discrete and finite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8799) OneVsRestModel should extend ClassificationModel

2018-04-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441207#comment-16441207
 ] 

Joseph K. Bradley commented on SPARK-8799:
--

The missing functionality was added in [SPARK-9312], but we cannot fix this 
JIRA until 3.0.0 since it will require breaking APIs (changing OneVsRest's 
inheritance structure and supported FeatureTypes).  Let's target this fix for 
3.0.0, for which I'll recommend:
* Rename current OneVsRest to GenericOneVsRest or something like that.  Have it 
inherit from Classifier and take a type parameter for FeaturesType.
* Add a specialization of GenericOneVsRest with fixed FeaturesType = VectorUDT, 
and call this new one OneVsRest.
I _think_ that will avoid breaking most user code (but I have not thought it 
through carefully).

> OneVsRestModel should extend ClassificationModel
> 
>
> Key: SPARK-8799
> URL: https://issues.apache.org/jira/browse/SPARK-8799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. 
> For example:
>  * `accColName` can be used to populate `ClassificationModel#predictRaw` and 
> share implementations of `transform`
>  * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be 
> gotten for free through subclassing
> `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) 
> because the labels for a `OneVsRest` will always be discrete and finite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel

2018-04-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8799:
-
Target Version/s: 3.0.0

> OneVsRestModel should extend ClassificationModel
> 
>
> Key: SPARK-8799
> URL: https://issues.apache.org/jira/browse/SPARK-8799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. 
> For example:
>  * `accColName` can be used to populate `ClassificationModel#predictRaw` and 
> share implementations of `transform`
>  * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be 
> gotten for free through subclassing
> `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) 
> because the labels for a `OneVsRest` will always be discrete and finite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2018-04-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21741:
-

Assignee: Weichen Xu

> Python API for DataFrame-based multivariate summarizer
> --
>
> Key: SPARK-21741
> URL: https://issues.apache.org/jira/browse/SPARK-21741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> We support multivariate summarizer for DataFrame API at SPARK-19634, we 
> should also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2018-04-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21741.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20695
[https://github.com/apache/spark/pull/20695]

> Python API for DataFrame-based multivariate summarizer
> --
>
> Key: SPARK-21741
> URL: https://issues.apache.org/jira/browse/SPARK-21741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> We support multivariate summarizer for DataFrame API at SPARK-19634, we 
> should also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23997) Configurable max number of buckets

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441146#comment-16441146
 ] 

Apache Spark commented on SPARK-23997:
--

User 'ferdonline' has created a pull request for this issue:
https://github.com/apache/spark/pull/21087

> Configurable max number of buckets
> --
>
> Key: SPARK-23997
> URL: https://issues.apache.org/jira/browse/SPARK-23997
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Fernando Pereira
>Priority: Major
>
> When exporting data as a table the user can choose to split data in buckets 
> by choosing the columns and the number of buckets. Currently there is a 
> hard-coded limit of 99'999 buckets.
> However, for heavy workloads this limit might be too restrictive, a situation 
> that will eventually become more common as workloads grow.
> As per the comments in SPARK-19618 this limit could be made configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23997) Configurable max number of buckets

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23997:


Assignee: Apache Spark

> Configurable max number of buckets
> --
>
> Key: SPARK-23997
> URL: https://issues.apache.org/jira/browse/SPARK-23997
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Fernando Pereira
>Assignee: Apache Spark
>Priority: Major
>
> When exporting data as a table the user can choose to split data in buckets 
> by choosing the columns and the number of buckets. Currently there is a 
> hard-coded limit of 99'999 buckets.
> However, for heavy workloads this limit might be too restrictive, a situation 
> that will eventually become more common as workloads grow.
> As per the comments in SPARK-19618 this limit could be made configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23997) Configurable max number of buckets

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23997:


Assignee: (was: Apache Spark)

> Configurable max number of buckets
> --
>
> Key: SPARK-23997
> URL: https://issues.apache.org/jira/browse/SPARK-23997
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Fernando Pereira
>Priority: Major
>
> When exporting data as a table the user can choose to split data in buckets 
> by choosing the columns and the number of buckets. Currently there is a 
> hard-coded limit of 99'999 buckets.
> However, for heavy workloads this limit might be too restrictive, a situation 
> that will eventually become more common as workloads grow.
> As per the comments in SPARK-19618 this limit could be made configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23986:
---

Assignee: Marco Gaido

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23986.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21080
[https://github.com/apache/spark/pull/21080]

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23999) Spark SQL shell is a Stable one ? Can we use Spark SQL shell in our production environment?

2018-04-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23999.

   Resolution: Invalid
Fix Version/s: (was: 2.3.0)
   (was: 3.0.0)

Please use the mailing lists for asking questions.

http://spark.apache.org/community.html

> Spark SQL shell is a Stable one ? Can we use Spark SQL shell in our 
> production environment?
> ---
>
> Key: SPARK-23999
> URL: https://issues.apache.org/jira/browse/SPARK-23999
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.3.0
>Reporter: Prabhu Bentick
>Priority: Major
>  Labels: features
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Hi Team,
> Understand from one of developer working in Apache Spark that Spark SQL shell 
> is not a stable one & it is not recomended to use Spark SQL Shell in 
> production environment, Is it true? Apache Spark 3.0 version going to release 
> the stable version.
> Can anyone please clarify.
> Regards
> Prabhu Bentick



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24002:


Assignee: Xiao Li  (was: Apache Spark)

> Task not serializable caused by 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
> 
>
> Key: SPARK-24002
> URL: https://issues.apache.org/jira/browse/SPARK-24002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> Having two queries one is a 1000-line SQL query and a 3000-line SQL query. 
> Need to run at least one hour with a heavy write workload to reproduce once. 
> {code}
> Py4JJavaError: An error occurred while calling o153.sql.
> : org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646)
>   at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
>   at py4j.Gateway.invoke(Gateway.java:293)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:226)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: 
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144)
>   ...
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
>   ... 23 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.spark.SparkException: Task not serializable
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179)
>   ... 276 more
> Caused by: org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2380)
>   at 
> 

[jira] [Assigned] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24002:


Assignee: Apache Spark  (was: Xiao Li)

> Task not serializable caused by 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
> 
>
> Key: SPARK-24002
> URL: https://issues.apache.org/jira/browse/SPARK-24002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Having two queries one is a 1000-line SQL query and a 3000-line SQL query. 
> Need to run at least one hour with a heavy write workload to reproduce once. 
> {code}
> Py4JJavaError: An error occurred while calling o153.sql.
> : org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646)
>   at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
>   at py4j.Gateway.invoke(Gateway.java:293)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:226)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: 
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144)
>   ...
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
>   ... 23 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.spark.SparkException: Task not serializable
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179)
>   ... 276 more
> Caused by: org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2380)
>   at 
> 

[jira] [Commented] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441071#comment-16441071
 ] 

Apache Spark commented on SPARK-24002:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/21086

> Task not serializable caused by 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
> 
>
> Key: SPARK-24002
> URL: https://issues.apache.org/jira/browse/SPARK-24002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> Having two queries one is a 1000-line SQL query and a 3000-line SQL query. 
> Need to run at least one hour with a heavy write workload to reproduce once. 
> {code}
> Py4JJavaError: An error occurred while calling o153.sql.
> : org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646)
>   at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
>   at py4j.Gateway.invoke(Gateway.java:293)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:226)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: 
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144)
>   ...
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
>   ... 23 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.spark.SparkException: Task not serializable
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179)
>   ... 276 more
> Caused by: org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
>   at 

[jira] [Updated] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes

2018-04-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24002:

Description: 
Having two queries one is a 1000-line SQL query and a 3000-line SQL query. Need 
to run at least one hour with a heavy write workload to reproduce once. 

{code}
Py4JJavaError: An error occurred while calling o153.sql.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021)
at 
org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020)
at org.apache.spark.sql.Dataset.(Dataset.scala:190)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646)
at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:293)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:226)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: 
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267)
at 
org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
at 
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37)
at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
at 
org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144)
...
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
... 23 more
Caused by: java.util.concurrent.ExecutionException: 
org.apache.spark.SparkException: Task not serializable
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179)
... 276 more
Caused by: org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2380)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:371)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:417)
at 

[jira] [Updated] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes

2018-04-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24002:

Description: 
{code}
java.lang.IllegalArgumentException
at java.nio.Buffer.position(Buffer.java:244)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484)
at sun.reflect.GeneratedMethodAccessor190.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
{code}

  was:
{nocode}
java.lang.IllegalArgumentException
at java.nio.Buffer.position(Buffer.java:244)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484)
at sun.reflect.GeneratedMethodAccessor190.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
{nocode}


> Task not serializable caused by 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
> 
>
> Key: SPARK-24002
> URL: https://issues.apache.org/jira/browse/SPARK-24002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> {code}
> java.lang.IllegalArgumentException
>   at java.nio.Buffer.position(Buffer.java:244)
>   at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153)
>   at java.nio.ByteBuffer.get(ByteBuffer.java:715)
>   at 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405)
>   at 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414)
>   at 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484)
>   at sun.reflect.GeneratedMethodAccessor190.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes

2018-04-17 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24002:
---

 Summary: Task not serializable caused by 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
 Key: SPARK-24002
 URL: https://issues.apache.org/jira/browse/SPARK-24002
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


{nocode}
java.lang.IllegalArgumentException
at java.nio.Buffer.position(Buffer.java:244)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484)
at sun.reflect.GeneratedMethodAccessor190.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
{nocode}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-04-17 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441006#comment-16441006
 ] 

Edwina Lu commented on SPARK-23206:
---

[~assia6], could you please try the new link, 
[https://docs.google.com/document/d/1fIL2XMHPnqs6kaeHr822iTvs08uuYnjP5roSGZfejyA/edit?usp=sharing]

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23888) speculative task should not run on a given host where another attempt is already running on

2018-04-17 Thread wuyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-23888:
-
Description: 
 

There's a bug in:
{code:java}
/** Check whether a task is currently running an attempt on a given host */
 private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = {
   taskAttempts(taskIndex).exists(_.host == host)
 }
{code}
This will ignore hosts which have finished attempts, so we should check whether 
the attempt is currently running on the given host. 

And it is possible for a speculative task to run on a host where another 
attempt failed here before.

Assume we have only two machines: host1, host2.  We first run task0.0 on host1. 
Then, due to  a long time waiting for task0.0, we launch a speculative task0.1 
on host2. And, task0.1 finally failed on host1, but it can not re-run since 
there's already  a copy running on host2. After another long time, we launch a 
new  speculative task0.2. And, now, we can run task0.2 on host1 again, since 
there's no more running attempt on host1.

**

After discussion in the PR, we simply make the comment be consistent the 
method's behavior. See details in PR#20998.

 

  was:
There's a bug in:
{code:java}
/** Check whether a task is currently running an attempt on a given host */
 private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = {
   taskAttempts(taskIndex).exists(_.host == host)
 }
{code}
This will ignore hosts which have finished attempts, so we should check whether 
the attempt is currently running on the given host. 

And it is possible for a speculative task to run on a host where another 
attempt failed here before.

Assume we have only two machines: host1, host2.  We first run task0.0 on host1. 
Then, due to  a long time waiting for task0.0, we launch a speculative task0.1 
on host2. And, task0.1 finally failed on host1, but it can not re-run since 
there's already  a copy running on host2. After another long time, we launch a 
new  speculative task0.2. And, now, we can run task0.2 on host1 again, since 
there's no more running attempt on host1.


> speculative task should not run on a given host where another attempt is 
> already running on
> ---
>
> Key: SPARK-23888
> URL: https://issues.apache.org/jira/browse/SPARK-23888
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Priority: Major
>  Labels: speculation
> Fix For: 2.3.0
>
>
>  
> There's a bug in:
> {code:java}
> /** Check whether a task is currently running an attempt on a given host */
>  private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = {
>taskAttempts(taskIndex).exists(_.host == host)
>  }
> {code}
> This will ignore hosts which have finished attempts, so we should check 
> whether the attempt is currently running on the given host. 
> And it is possible for a speculative task to run on a host where another 
> attempt failed here before.
> Assume we have only two machines: host1, host2.  We first run task0.0 on 
> host1. Then, due to  a long time waiting for task0.0, we launch a speculative 
> task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run 
> since there's already  a copy running on host2. After another long time, we 
> launch a new  speculative task0.2. And, now, we can run task0.2 on host1 
> again, since there's no more running attempt on host1.
> **
> After discussion in the PR, we simply make the comment be consistent the 
> method's behavior. See details in PR#20998.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24001) Multinode cluster

2018-04-17 Thread Direselign (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Direselign updated SPARK-24001:
---
Attachment: Screenshot from 2018-04-17 22-47-39.png

> Multinode cluster 
> --
>
> Key: SPARK-24001
> URL: https://issues.apache.org/jira/browse/SPARK-24001
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Direselign
>Priority: Major
> Attachments: Screenshot from 2018-04-17 22-47-39.png
>
>
> I was trying to configure Apache spark cluster on two ubuntu 16.04 machines 
> using yarn but it is not working when I submit a task to the clusters



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase

2018-04-17 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440942#comment-16440942
 ] 

Ruslan Dautkhanov commented on SPARK-23963:
---

Thanks a lot [~bersprockets] 

Would it be possible to backport this to Spark 2.3 as well ? Thanks!

> Queries on text-based Hive tables grow disproportionately slower as the 
> number of columns increase
> --
>
> Key: SPARK-23963
> URL: https://issues.apache.org/jira/browse/SPARK-23963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.4.0
>
>
> TableReader gets disproportionately slower as the number of columns in the 
> query increase.
> For example, reading a table with 6000 columns is 4 times more expensive per 
> record than reading a table with 3000 columns, rather than twice as expensive.
> The increase in processing time is due to several Lists (fieldRefs, 
> fieldOrdinals, and unwrappers), each of which the reader accesses by column 
> number for each column in a record. Because each List has O\(n\) time for 
> lookup by column number, these lookups grow increasingly expensive as the 
> column count increases.
> When I patched the code to change those 3 Lists to Arrays, the query times 
> became proportional.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24001) Multinode cluster

2018-04-17 Thread Direselign (JIRA)
Direselign created SPARK-24001:
--

 Summary: Multinode cluster 
 Key: SPARK-24001
 URL: https://issues.apache.org/jira/browse/SPARK-24001
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Direselign


I was trying to configure Apache spark cluster on two ubuntu 16.04 machines 
using yarn but it is not working when I submit a task to the clusters



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-04-17 Thread Ben Doerr (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402271#comment-16402271
 ] 

Ben Doerr edited comment on SPARK-22371 at 4/17/18 2:04 PM:


We've seen this for the first time on 2.3.0. 

The scenario that [~mayank.agarwal2305] described of jobs running in parallel 
on datasets and union of all datasets and full gc triggered" sounds exactly 
like our scenario. We've been unable to upgrade because of this issue.


was (Author: craftsman):
We've seen this for the first time on 2.3.0. 

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440887#comment-16440887
 ] 

Apache Spark commented on SPARK-23948:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/21085

> Trigger mapstage's job listener in submitMissingTasks
> -
>
> Key: SPARK-23948
> URL: https://issues.apache.org/jira/browse/SPARK-23948
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.4.0
>
>
> SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
> "markMapStageJobAsFinished" is called only in 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
>  and   
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);
> But think about below scenario:
> 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
> 2. We submit stage1 by "submitMapStage";
> 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
> resubmitted as stage0_1 and stage1_1;
> 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
> but stage1 is not inside "runningStages". So even though all splits(including 
> the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
> called;
> 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", 
> there is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-17 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-23948:


Assignee: jin xing

> Trigger mapstage's job listener in submitMissingTasks
> -
>
> Key: SPARK-23948
> URL: https://issues.apache.org/jira/browse/SPARK-23948
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.4.0
>
>
> SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
> "markMapStageJobAsFinished" is called only in 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
>  and   
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);
> But think about below scenario:
> 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
> 2. We submit stage1 by "submitMapStage";
> 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
> resubmitted as stage0_1 and stage1_1;
> 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
> but stage1 is not inside "runningStages". So even though all splits(including 
> the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
> called;
> 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", 
> there is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-17 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-23948.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21019
[https://github.com/apache/spark/pull/21019]

> Trigger mapstage's job listener in submitMissingTasks
> -
>
> Key: SPARK-23948
> URL: https://issues.apache.org/jira/browse/SPARK-23948
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.4.0
>
>
> SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
> "markMapStageJobAsFinished" is called only in 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
>  and   
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);
> But think about below scenario:
> 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
> 2. We submit stage1 by "submitMapStage";
> 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
> resubmitted as stage0_1 and stage1_1;
> 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
> but stage1 is not inside "runningStages". So even though all splits(including 
> the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
> called;
> 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", 
> there is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22676) Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true

2018-04-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22676.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19868
[https://github.com/apache/spark/pull/19868]

> Avoid iterating all partition paths when 
> spark.sql.hive.verifyPartitionPath=true
> 
>
> Key: SPARK-22676
> URL: https://issues.apache.org/jira/browse/SPARK-22676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.4.0
>
>
> In current code, it will scanning all partition paths when 
> spark.sql.hive.verifyPartitionPath=true.
> e.g. table like below:
> CREATE TABLE `test`(
>   `id` int,
>   `age` int,
>   `name` string)
> PARTITIONED BY (
>   `A` string,
>   `B` string)
> load data local inpath '/tmp/data1' into table test partition(A='00', B='00')
> load data local inpath '/tmp/data1' into table test partition(A='01', B='01')
> load data local inpath '/tmp/data1' into table test partition(A='10', B='10')
> load data local inpath '/tmp/data1' into table test partition(A='11', B='11')
> If I query with SQL -- "select * from test where year=2017 and month=12 and 
> day=03", current code will scan all partition paths including 
> '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', 
> '/data/A=11/B=11'. It costs much time and memory cost. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22676) Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true

2018-04-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22676:
---

Assignee: jin xing

> Avoid iterating all partition paths when 
> spark.sql.hive.verifyPartitionPath=true
> 
>
> Key: SPARK-22676
> URL: https://issues.apache.org/jira/browse/SPARK-22676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.4.0
>
>
> In current code, it will scanning all partition paths when 
> spark.sql.hive.verifyPartitionPath=true.
> e.g. table like below:
> CREATE TABLE `test`(
>   `id` int,
>   `age` int,
>   `name` string)
> PARTITIONED BY (
>   `A` string,
>   `B` string)
> load data local inpath '/tmp/data1' into table test partition(A='00', B='00')
> load data local inpath '/tmp/data1' into table test partition(A='01', B='01')
> load data local inpath '/tmp/data1' into table test partition(A='10', B='10')
> load data local inpath '/tmp/data1' into table test partition(A='11', B='11')
> If I query with SQL -- "select * from test where year=2017 and month=12 and 
> day=03", current code will scan all partition paths including 
> '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', 
> '/data/A=11/B=11'. It costs much time and memory cost. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23835) When Dataset.as converts column from nullable to non-nullable type, null Doubles are converted silently to -1

2018-04-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23835.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.4.0
   2.3.1

> When Dataset.as converts column from nullable to non-nullable type, null 
> Doubles are converted silently to -1
> -
>
> Key: SPARK-23835
> URL: https://issues.apache.org/jira/browse/SPARK-23835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I constructed a DataFrame with a nullable java.lang.Double column (and an 
> extra Double column).  I then converted it to a Dataset using ```as[(Double, 
> Double)]```.  When the Dataset is shown, it has a null.  When it is collected 
> and printed, the null is silently converted to a -1.
> Code snippet to reproduce this:
> {code}
> val localSpark = spark
> import localSpark.implicits._
> val df = Seq[(java.lang.Double, Double)](
>   (1.0, 2.0),
>   (3.0, 4.0),
>   (Double.NaN, 5.0),
>   (null, 6.0)
> ).toDF("a", "b")
> df.show()  // OUTPUT 1: has null
> df.printSchema()
> val data = df.as[(Double, Double)]
> data.show()  // OUTPUT 2: has null
> data.collect().foreach(println)  // OUTPUT 3: has -1
> {code}
> OUTPUT 1 and 2:
> {code}
> ++---+
> |   a|  b|
> ++---+
> | 1.0|2.0|
> | 3.0|4.0|
> | NaN|5.0|
> |null|6.0|
> ++---+
> {code}
> OUTPUT 3:
> {code}
> (1.0,2.0)
> (3.0,4.0)
> (NaN,5.0)
> (-1.0,6.0)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15703) Make ListenerBus event queue size configurable

2018-04-17 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440855#comment-16440855
 ] 

Thomas Graves commented on SPARK-15703:
---

this Jira is purely making the size of the event queue configurable which would 
allow you to increase it as long as you have sufficient driver memory.  There 
is no current fix for it dropping events. There is a fix that went into 2.3 
that makes it so the critical services aren't affected:

https://issues.apache.org/jira/browse/SPARK-18838

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24000) S3A: Create Table should fail on invalid AK/SK

2018-04-17 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440841#comment-16440841
 ] 

Brahma Reddy Battula commented on SPARK-24000:
--

Discussed [~ste...@apache.org] with offline, create table code should be 
thinking of doing a getFileStatus() on the path to see what's there and so fail 
on any no-permissions situation.

 

Please forgive me,if I am not clear. I am a newbie to spark.

> S3A: Create Table should fail on invalid AK/SK
> --
>
> Key: SPARK-24000
> URL: https://issues.apache.org/jira/browse/SPARK-24000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Brahma Reddy Battula
>Priority: Major
>
> Currently, When we pass the i{color:#FF}nvalid ak&{color} *create 
> table* will be the *success*.
> when the S3AFileSystem is initialized, *verifyBucketExists*() is called, 
> which will return *True* as the status code 403 
> (*_BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)_*  _from following as bucket exists._
> {code:java}
> public boolean doesBucketExist(String bucketName)
>  throws AmazonClientException, AmazonServiceException {
>  
>  try {
>      headBucket(new HeadBucketRequest(bucketName));
>  return true;
>      } catch (AmazonServiceException ase) {
>  // A redirect error or a forbidden error means the bucket exists. So
>      // returning true.
>  if ((ase.getStatusCode() == Constants.BUCKET_REDIRECT_STATUS_CODE)
>      || (ase.getStatusCode() == 
> Constants.BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)) {
>  return true;
>      }
>      if (ase.getStatusCode() == Constants.NO_SUCH_BUCKET_STATUS_CODE) {
>  return false;
>      }
>  throw ase;
>  
>      }
>  }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23875) Create IndexedSeq wrapper for ArrayData

2018-04-17 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23875.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.4.0

> Create IndexedSeq wrapper for ArrayData
> ---
>
> Key: SPARK-23875
> URL: https://issues.apache.org/jira/browse/SPARK-23875
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> We don't have a good way to sequentially access {{UnsafeArrayData}} with a 
> common interface such as Seq. An example is {{MapObject}} where we need to 
> access several sequence collection types together. But {{UnsafeArrayData}} 
> doesn't implement {{ArrayData.array}}. Calling {{toArray}} will copy the 
> entire array. We can provide an {{IndexedSeq}} wrapper for {{ArrayData}}, so 
> we can avoid copying the entire array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24000) S3A: Create Table should fail on invalid AK/SK

2018-04-17 Thread Brahma Reddy Battula (JIRA)
Brahma Reddy Battula created SPARK-24000:


 Summary: S3A: Create Table should fail on invalid AK/SK
 Key: SPARK-24000
 URL: https://issues.apache.org/jira/browse/SPARK-24000
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.3.0
Reporter: Brahma Reddy Battula


Currently, When we pass the i{color:#FF}nvalid ak&{color} *create table* 
will be the *success*.

when the S3AFileSystem is initialized, *verifyBucketExists*() is called, which 
will return *True* as the status code 403 
(*_BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)_*  _from following as bucket exists._
{code:java}
public boolean doesBucketExist(String bucketName)
 throws AmazonClientException, AmazonServiceException {
 
 try {
     headBucket(new HeadBucketRequest(bucketName));
 return true;
     } catch (AmazonServiceException ase) {
 // A redirect error or a forbidden error means the bucket exists. So
     // returning true.
 if ((ase.getStatusCode() == Constants.BUCKET_REDIRECT_STATUS_CODE)
     || (ase.getStatusCode() == 
Constants.BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)) {
 return true;
     }
     if (ase.getStatusCode() == Constants.NO_SUCH_BUCKET_STATUS_CODE) {
 return false;
     }
 throw ase;
 
     }
 }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23999) Spark SQL shell is a Stable one ? Can we use Spark SQL shell in our production environment?

2018-04-17 Thread Prabhu Bentick (JIRA)
Prabhu Bentick created SPARK-23999:
--

 Summary: Spark SQL shell is a Stable one ? Can we use Spark SQL 
shell in our production environment?
 Key: SPARK-23999
 URL: https://issues.apache.org/jira/browse/SPARK-23999
 Project: Spark
  Issue Type: Question
  Components: Spark Core, Spark Shell, Spark Submit
Affects Versions: 2.3.0
Reporter: Prabhu Bentick
 Fix For: 3.0.0, 2.3.0


Hi Team,

Understand from one of developer working in Apache Spark that Spark SQL shell 
is not a stable one & it is not recomended to use Spark SQL Shell in production 
environment, Is it true? Apache Spark 3.0 version going to release the stable 
version.

Can anyone please clarify.

Regards

Prabhu Bentick



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23747) Add EpochCoordinator unit tests

2018-04-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-23747:
-

Assignee: Jose Torres

> Add EpochCoordinator unit tests
> ---
>
> Key: SPARK-23747
> URL: https://issues.apache.org/jira/browse/SPARK-23747
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23747) Add EpochCoordinator unit tests

2018-04-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23747.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 20983
[https://github.com/apache/spark/pull/20983]

> Add EpochCoordinator unit tests
> ---
>
> Key: SPARK-23747
> URL: https://issues.apache.org/jira/browse/SPARK-23747
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-04-17 Thread Andrew Clegg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440683#comment-16440683
 ] 

Andrew Clegg edited comment on SPARK-22371 at 4/17/18 9:53 AM:
---

Another data point -- I've seen this happen (in 2.3.0) during cleanup after a 
task failure:

{code:none}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Exception while getting task result: java.lang.IllegalStateException: Attempted 
to access garbage collected accumulator 365
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
... 31 more
{code}


was (Author: aclegg):
Another data point -- I've seen this happen (in 2.3.0) during cleanup after a 
task failure:

{{Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Exception while getting task result: java.lang.IllegalStateException: Attempted 
to access garbage collected accumulator 365}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)}}
{{ at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)}}
{{ at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}}
{{ at scala.Option.foreach(Option.scala:257)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)}}
{{ at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)}}
{{ at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)}}
{{ at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)}}
{{ ... 31 more}}

 

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our 

[jira] [Commented] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-04-17 Thread Andrew Clegg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440683#comment-16440683
 ] 

Andrew Clegg commented on SPARK-22371:
--

Another data point -- I've seen this happen (in 2.3.0) during cleanup after a 
task failure:

{{Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Exception while getting task result: java.lang.IllegalStateException: Attempted 
to access garbage collected accumulator 365}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)}}
{{ at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)}}
{{ at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}}
{{ at scala.Option.foreach(Option.scala:257)}}
{{ at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)}}
{{ at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)}}
{{ at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)}}
{{ at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)}}
{{ at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)}}
{{ at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)}}
{{ ... 31 more}}

 

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Resolved] (SPARK-23687) Add MemoryStream

2018-04-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23687.
---
   Resolution: Done
Fix Version/s: 2.4.0

> Add MemoryStream
> 
>
> Key: SPARK-23687
> URL: https://issues.apache.org/jira/browse/SPARK-23687
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.4.0
>
>
> We need a MemoryStream for continuous processing, both in order to write less 
> fragile tests and to eventually use existing stream tests to verify 
> functional equivalence.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23687) Add MemoryStream

2018-04-17 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-23687:
-

Assignee: Jose Torres

> Add MemoryStream
> 
>
> Key: SPARK-23687
> URL: https://issues.apache.org/jira/browse/SPARK-23687
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.4.0
>
>
> We need a MemoryStream for continuous processing, both in order to write less 
> fragile tests and to eventually use existing stream tests to verify 
> functional equivalence.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23918) High-order function: array_min(x) → x

2018-04-17 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23918.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21025
[https://github.com/apache/spark/pull/21025]

> High-order function: array_min(x) → x
> -
>
> Key: SPARK-23918
> URL: https://issues.apache.org/jira/browse/SPARK-23918
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the minimum value of input array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23918) High-order function: array_min(x) → x

2018-04-17 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23918:
-

Assignee: Marco Gaido

> High-order function: array_min(x) → x
> -
>
> Key: SPARK-23918
> URL: https://issues.apache.org/jira/browse/SPARK-23918
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the minimum value of input array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23998) It may be better to add @transient to field 'taskMemoryManager' in class Task, for it is only be set and used in executor side

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23998:


Assignee: Apache Spark

> It may be better to add @transient to field 'taskMemoryManager' in class 
> Task, for it is only be set and used in executor side
> --
>
> Key: SPARK-23998
> URL: https://issues.apache.org/jira/browse/SPARK-23998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: eaton
>Assignee: Apache Spark
>Priority: Minor
>
> It may be better to add @transient to field 'taskMemoryManager' in class 
> Task, for it is only be set and used in executor side



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23998) It may be better to add @transient to field 'taskMemoryManager' in class Task, for it is only be set and used in executor side

2018-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23998:


Assignee: (was: Apache Spark)

> It may be better to add @transient to field 'taskMemoryManager' in class 
> Task, for it is only be set and used in executor side
> --
>
> Key: SPARK-23998
> URL: https://issues.apache.org/jira/browse/SPARK-23998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: eaton
>Priority: Minor
>
> It may be better to add @transient to field 'taskMemoryManager' in class 
> Task, for it is only be set and used in executor side



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23998) It may be better to add @transient to field 'taskMemoryManager' in class Task, for it is only be set and used in executor side

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440598#comment-16440598
 ] 

Apache Spark commented on SPARK-23998:
--

User 'eatoncys' has created a pull request for this issue:
https://github.com/apache/spark/pull/21084

> It may be better to add @transient to field 'taskMemoryManager' in class 
> Task, for it is only be set and used in executor side
> --
>
> Key: SPARK-23998
> URL: https://issues.apache.org/jira/browse/SPARK-23998
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: eaton
>Priority: Minor
>
> It may be better to add @transient to field 'taskMemoryManager' in class 
> Task, for it is only be set and used in executor side



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >