[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilterScanSuite
[ https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8166: - Description: See screenshot. It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ ... 10+ others was: It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ ... 10+ others Consistently failing test: o.a.s.sql.sources.FilterScanSuite Key: SPARK-8166 URL: https://issues.apache.org/jira/browse/SPARK-8166 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.4.1 Reporter: Andrew Or Assignee: Yin Huai Priority: Blocker Attachments: master SBT.png See screenshot. It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ ... 10+ others -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilterScanSuite
[ https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8166: - Attachment: master SBT.png Consistently failing test: o.a.s.sql.sources.FilterScanSuite Key: SPARK-8166 URL: https://issues.apache.org/jira/browse/SPARK-8166 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.4.1 Reporter: Andrew Or Assignee: Yin Huai Priority: Blocker Attachments: master SBT.png It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ ... 10+ others -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8168) Add Python friendly constructor to PipelineModel
[ https://issues.apache.org/jira/browse/SPARK-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577748#comment-14577748 ] Apache Spark commented on SPARK-8168: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/6709 Add Python friendly constructor to PipelineModel Key: SPARK-8168 URL: https://issues.apache.org/jira/browse/SPARK-8168 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We are trying to migrate all Python implementations of Pipeline components to Scala. As part of this effort, PipelineModel should have a Python-friendly constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8168) Add Python friendly constructor to PipelineModel
[ https://issues.apache.org/jira/browse/SPARK-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8168: --- Assignee: Xiangrui Meng (was: Apache Spark) Add Python friendly constructor to PipelineModel Key: SPARK-8168 URL: https://issues.apache.org/jira/browse/SPARK-8168 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We are trying to migrate all Python implementations of Pipeline components to Scala. As part of this effort, PipelineModel should have a Python-friendly constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite
[ https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8166: - Affects Version/s: (was: 1.4.1) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite -- Key: SPARK-8166 URL: https://issues.apache.org/jira/browse/SPARK-8166 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Yin Huai Priority: Blocker Attachments: master SBT.png See screenshot. It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ ... 10+ others -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite
[ https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8166: - Target Version/s: 1.5.0 (was: 1.4.1, 1.5.0) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite -- Key: SPARK-8166 URL: https://issues.apache.org/jira/browse/SPARK-8166 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.5.0 Reporter: Andrew Or Assignee: Yin Huai Priority: Blocker Attachments: master SBT.png See screenshot. It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ ... 10+ others -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8168) Add Python friendly constructor to PipelineModel
[ https://issues.apache.org/jira/browse/SPARK-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8168: --- Assignee: Apache Spark (was: Xiangrui Meng) Add Python friendly constructor to PipelineModel Key: SPARK-8168 URL: https://issues.apache.org/jira/browse/SPARK-8168 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark We are trying to migrate all Python implementations of Pipeline components to Scala. As part of this effort, PipelineModel should have a Python-friendly constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
Patrick Woody created SPARK-8167: Summary: Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8062) NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics
[ https://issues.apache.org/jira/browse/SPARK-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8062: -- Comment: was deleted (was: While working to try to reproduce this bug, I noticed something rather curious: In {{InputOutputMetricsSuite}}, the output metrics tests are guarded by {{if}} statements that check whether the bytesWrittenOnThreadCallback is defined: {code} test(output metrics when writing text file) { val fs = FileSystem.getLocal(new Configuration()) val outPath = new Path(fs.getWorkingDirectory, outdir) if (SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback(outPath, fs.getConf).isDefined) { // ... Body of test case ... } } {code} AFAIK this test was introduced in order to prevent this test's assertions from failing under pre-Hadoop-2.5 versions of Hadoop. Now, take a look at the regression test that I added to try to reproduce this bug: {code} test(exceptions while getting IO thread statistics should not fail tasks / jobs (SPARK-8062)) { FileSystem.getStatistics(null, classOf[FileSystem]) val fs = FileSystem.getLocal(new Configuration()) val outPath = new Path(fs.getWorkingDirectory, outdir) // This test passes unless the following line is commented out. The following line therefore // has some side-effects that are impacting the system under test: SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback(outPath, fs.getConf).isDefined val rdd = sc.parallelize(Array(a, b, c, d), 2) try { rdd.saveAsTextFile(outPath.toString) } finally { fs.delete(outPath, true) } } {code} In this test, I try to pollute the global FileSystem statistics registry by storing a statistics entry for a filesystem with a null URI. For this test, all I care about is Spark not crashing, so I didn't add the {{if}} check (I don't need to worry about the assertions failing on pre-Hadoop-2.5 versions here since there aren't any assertions that check the metrics for this test). Surprisingly, though, my test was unable to fail until I added a {code} SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback(outPath, fs.getConf).isDefined {code} check outside of an {{if}} statement. This implies that this method has side effects which influence whether other metrics retrieval code is called. I worry that this may imply that our other InputOutputMetrics code could be broken for real production jobs. I'd like to investigate this and fix this issue, while also hardening this code: I think that we should be performing significantly more null checks for the input and output of Hadoop methods and should be using a pure function to determine whether our Hadoop version supports these metrics rather than calling a method that might have side-effects (I think we can do this purely via reflection without actually creating any objects / calling any methods). Since this JIRA is somewhat time sensitive, though, I'm going to work on a patch just for the null checks here, then will open a followup to investigate further hardening of the input output metrics code.) NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics - Key: SPARK-8062 URL: https://issues.apache.org/jira/browse/SPARK-8062 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: MapR 4.0.1, Hadoop 2.4.1, Yarn Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.2.3 I received the following error report from a user: While running a Spark Streaming job that reads from MapRfs and writes to HBase using Spark 1.2.1, the job intermittently experiences a total job failure due to the following errors: {code} 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 1.1 in stage 6.0 (TID 24) java.lang.NullPointerException at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139) at
[jira] [Resolved] (SPARK-8062) NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics
[ https://issues.apache.org/jira/browse/SPARK-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-8062. --- Resolution: Fixed Fix Version/s: 1.2.3 Issue resolved by pull request 6618 [https://github.com/apache/spark/pull/6618] NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics - Key: SPARK-8062 URL: https://issues.apache.org/jira/browse/SPARK-8062 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: MapR 4.0.1, Hadoop 2.4.1, Yarn Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.2.3 I received the following error report from a user: While running a Spark Streaming job that reads from MapRfs and writes to HBase using Spark 1.2.1, the job intermittently experiences a total job failure due to the following errors: {code} 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 1.1 in stage 6.0 (TID 24) java.lang.NullPointerException at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:116) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 15/05/28 10:35:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 25 15/05/28 10:35:50 INFO executor.Executor: Running task 2.1 in stage 6.0 (TID 25) 15/05/28 10:35:50 INFO rdd.NewHadoopRDD: Input split: hdfs:/[REDACTED] 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 2.1 in stage 6.0 (TID 25) java.lang.NullPointerException at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) {code} Diving into the code here: The NPE is occurring on this line of SparkHadoopUtil (in 1.2.1.): https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L178 Here's that block of code from 1.2.1 (it's the same in 1.2.2): {code} private def getFileSystemThreadStatistics(path: Path, conf: Configuration): Seq[AnyRef] = { val qualifiedPath = path.getFileSystem(conf).makeQualified(path) val scheme = qualifiedPath.toUri().getScheme() val stats = FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme)) //
[jira] [Issue Comment Deleted] (SPARK-8062) NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics
[ https://issues.apache.org/jira/browse/SPARK-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8062: -- Comment: was deleted (was: Alright, I've filed https://issues.apache.org/jira/browse/SPARK-8086 to follow up on hardening InputOutputMetricsSuite.) NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics - Key: SPARK-8062 URL: https://issues.apache.org/jira/browse/SPARK-8062 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: MapR 4.0.1, Hadoop 2.4.1, Yarn Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.2.3 I received the following error report from a user: While running a Spark Streaming job that reads from MapRfs and writes to HBase using Spark 1.2.1, the job intermittently experiences a total job failure due to the following errors: {code} 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 1.1 in stage 6.0 (TID 24) java.lang.NullPointerException at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) at scala.collection.AbstractTraversable.filter(Traversable.scala:105) at org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:116) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 15/05/28 10:35:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 25 15/05/28 10:35:50 INFO executor.Executor: Running task 2.1 in stage 6.0 (TID 25) 15/05/28 10:35:50 INFO rdd.NewHadoopRDD: Input split: hdfs:/[REDACTED] 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 2.1 in stage 6.0 (TID 25) java.lang.NullPointerException at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178) at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) {code} Diving into the code here: The NPE is occurring on this line of SparkHadoopUtil (in 1.2.1.): https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L178 Here's that block of code from 1.2.1 (it's the same in 1.2.2): {code} private def getFileSystemThreadStatistics(path: Path, conf: Configuration): Seq[AnyRef] = { val qualifiedPath = path.getFileSystem(conf).makeQualified(path) val scheme = qualifiedPath.toUri().getScheme() val stats =
[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577599#comment-14577599 ] Alex commented on SPARK-2344: - Hi guys, What is the status of this issue? Beniamino - are you planning to submit you version of the algorithm? Add Fuzzy C-Means algorithm to MLlib Key: SPARK-2344 URL: https://issues.apache.org/jira/browse/SPARK-2344 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Alex Priority: Minor Labels: clustering Original Estimate: 1m Remaining Estimate: 1m I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. FCM is very similar to K - Means which is already implemented, and they differ only in the degree of relationship each point has with each cluster: (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. As part of the implementation I would like: - create a base class for K- Means and FCM - implement the relationship for each algorithm differently (in its class) I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8116) sc.range() doesn't match python range()
[ https://issues.apache.org/jira/browse/SPARK-8116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8116. Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 Assignee: Ted Blackman sc.range() doesn't match python range() --- Key: SPARK-8116 URL: https://issues.apache.org/jira/browse/SPARK-8116 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0, 1.4.1 Reporter: Ted Blackman Assignee: Ted Blackman Priority: Minor Labels: easyfix Fix For: 1.4.1, 1.5.0 Python's built-in range() and xrange() functions can take 1, 2, or 3 arguments. Ranges with just 1 argument are probably used the most frequently, e.g.: for i in range(len(myList)): ... However, in pyspark, the SparkContext range() method throws an error when called with a single argument, due to the way its arguments get passed into python's range function. There's no good reason that I can think of not to support the same syntax as the built-in function. To fix this, we can set the default of the sc.range() method's `stop` argument to None, and then inside the method, if it is None, replace `stop` with `start` and set `start` to 0, which is what the c implementation of range() does: https://github.com/python/cpython/blob/master/Objects/rangeobject.c#L87 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7426) spark.ml AttributeFactory.fromStructField should allow other NumericTypes
[ https://issues.apache.org/jira/browse/SPARK-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577648#comment-14577648 ] Joseph K. Bradley commented on SPARK-7426: -- Hi, I'm sorry for the slow review! I'll try to look soon... spark.ml AttributeFactory.fromStructField should allow other NumericTypes - Key: SPARK-7426 URL: https://issues.apache.org/jira/browse/SPARK-7426 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor It currently only supports DoubleType, but it should support others, at least for fromStructField (importing into ML attribute format, rather than exporting). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8166) Failing test: o.a.s.sql.sources.FilterScanSuite
Andrew Or created SPARK-8166: Summary: Failing test: o.a.s.sql.sources.FilterScanSuite Key: SPARK-8166 URL: https://issues.apache.org/jira/browse/SPARK-8166 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.4.1 Reporter: Andrew Or Assignee: Yin Huai Priority: Blocker It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ ... 10+ others -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilterScanSuite
[ https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8166: - Summary: Consistently failing test: o.a.s.sql.sources.FilterScanSuite (was: Failing test: o.a.s.sql.sources.FilterScanSuite) Consistently failing test: o.a.s.sql.sources.FilterScanSuite Key: SPARK-8166 URL: https://issues.apache.org/jira/browse/SPARK-8166 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.4.1 Reporter: Andrew Or Assignee: Yin Huai Priority: Blocker It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ ... 10+ others -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite
[ https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8166: - Summary: Consistently failing test: o.a.s.sql.sources.FilteredScanSuite (was: Consistently failing test: o.a.s.sql.sources.FilterScanSuite) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite -- Key: SPARK-8166 URL: https://issues.apache.org/jira/browse/SPARK-8166 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.4.1, 1.5.0 Reporter: Andrew Or Assignee: Yin Huai Priority: Blocker Attachments: master SBT.png See screenshot. It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ ... 10+ others -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8102) Big performance difference when joining 3 tables in different order
[ https://issues.apache.org/jira/browse/SPARK-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577602#comment-14577602 ] Yin Huai commented on SPARK-8102: - For the first query, {code} -- snippet 1 SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt FROM t_category c, t_zipcode z, click_meter_site_grouped g WHERE c.refCategoryID = g.category AND z.regionCode = g.region {code} I guess the analyzer got confused because first two tables are c and z but there is no join condition between them. So, a {{CartesianProduct}} was used. Although your tables can be small, but the result of a {{CartesianProduct}} can be huge. I guess we can do something in the analyzer to get rid of the {{CartesianProduct}}. Big performance difference when joining 3 tables in different order --- Key: SPARK-8102 URL: https://issues.apache.org/jira/browse/SPARK-8102 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Environment: spark in local mode Reporter: Hao Ren Attachments: query2job.png, query3job.png Given 3 tables loaded from CSV files: ( tables name = size) *click_meter_site_grouped* =10 687 455 bytes *t_zipcode* = 2 738 954 bytes *t_category* = 2 182 bytes When joining the 3 tables, I notice a large performance difference if they are joined in different order. Here are the SQL queries to compare: {code} -- snippet 1 SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt FROM t_category c, t_zipcode z, click_meter_site_grouped g WHERE c.refCategoryID = g.category AND z.regionCode = g.region {code} {code} -- snippet 2 SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt FROM t_category c, click_meter_site_grouped g, t_zipcode z WHERE c.refCategoryID = g.category AND z.regionCode = g.region {code} As you see, the largest table *click_meter_site_grouped* is the last table in FROM clause in the first snippet, and it is in the middle of table list in second one. Snippet 2 runs three times faster than Snippet 1. (8 seconds VS 24 seconds) As the data is just sampled from a large data set, if we test it on the original data set, it will normally result in a performance issue. After checking the log, we found something strange In snippet 1's log: 15/06/04 15:32:03 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:04 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:04 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:05 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:05 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:05 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:05 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:06 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:06 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:06 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:07 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:07 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:07 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:07 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:08 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:08 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:08 INFO HadoopRDD: Input split: file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 15/06/04 15:32:09 INFO HadoopRDD: Input split:
[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite
[ https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8166: - Affects Version/s: 1.5.0 Consistently failing test: o.a.s.sql.sources.FilteredScanSuite -- Key: SPARK-8166 URL: https://issues.apache.org/jira/browse/SPARK-8166 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.4.1, 1.5.0 Reporter: Andrew Or Assignee: Yin Huai Priority: Blocker Attachments: master SBT.png See screenshot. It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile. https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ ... 10+ others -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7426) spark.ml AttributeFactory.fromStructField should allow other NumericTypes
[ https://issues.apache.org/jira/browse/SPARK-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577629#comment-14577629 ] Mike Dusenberry commented on SPARK-7426: [~josephkb] Do you have any thoughts on this PR? spark.ml AttributeFactory.fromStructField should allow other NumericTypes - Key: SPARK-7426 URL: https://issues.apache.org/jira/browse/SPARK-7426 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor It currently only supports DoubleType, but it should support others, at least for fromStructField (importing into ML attribute format, rather than exporting). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577727#comment-14577727 ] Matt Cheah commented on SPARK-8167: --- To be clear this is independent of SPARK-7451. SPARK-7451 helps for the case that executors die too many times from preemption, but it doesn't not help if the exact same task gets preempted many times. Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577552#comment-14577552 ] Kannan Rajah commented on SPARK-1403: - A similar problem has been reported while running Spark on Mesos using MapR distribution. The current thread's context class loader is NULL inside the executor process causing NPE in MapR code. Refer to this discussion. http://answers.mapr.com/questions/163353/spark-from-apache-downloads-site-for-mapr.html#answer-163484 Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker Fix For: 1.0.0 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8121) When using with Hadoop 1.x, spark.sql.parquet.output.committer.class is overriden by spark.sql.sources.outputCommitterClass
[ https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-8121. - Resolution: Fixed Fix Version/s: 1.4.1 Issue resolved by pull request 6705 [https://github.com/apache/spark/pull/6705] When using with Hadoop 1.x, spark.sql.parquet.output.committer.class is overriden by spark.sql.sources.outputCommitterClass --- Key: SPARK-8121 URL: https://issues.apache.org/jira/browse/SPARK-8121 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.4.1 When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and {{spark.sql.sources.outputCommitterClass}} is configured, {{spark.sql.parquet.output.committer.class}} will be overriden. For example, if {{spark.sql.parquet.output.committer.class}} is set to {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor {{_common_metadata}} will be written because {{FileOutputCommitter}} overrides {{DirectParquetOutputCommitter}}. The reason is that, {{InsertIntoHadoopFsRelation}} initializes the {{TaskAttemptContext}} before calling {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} constructor clones the job configuration, thus doesn't share the job configuration passed to {{ParquetRelation2.prepareForWriteJob()}}. This issue can be fixed by simply [switching these two lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286]. Here is a Spark shell snippet for reproducing this issue: {code} import sqlContext._ sc.hadoopConfiguration.set( spark.sql.sources.outputCommitterClass, org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter) sc.hadoopConfiguration.set( spark.sql.parquet.output.committer.class, org.apache.spark.sql.parquet.DirectParquetOutputCommitter) range(0, 1).write.mode(overwrite).parquet(file:///tmp/foo) {code} Then check {{/tmp/foo}}, Parquet summary files are missing: {noformat} /tmp/foo ├── _SUCCESS ├── part-r-1.gz.parquet ├── part-r-2.gz.parquet ├── part-r-3.gz.parquet ├── part-r-4.gz.parquet ├── part-r-5.gz.parquet ├── part-r-6.gz.parquet ├── part-r-7.gz.parquet └── part-r-8.gz.parquet {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8168) Add Python friendly constructor to PipelineModel
Xiangrui Meng created SPARK-8168: Summary: Add Python friendly constructor to PipelineModel Key: SPARK-8168 URL: https://issues.apache.org/jira/browse/SPARK-8168 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We are trying to migrate all Python implementations of Pipeline components to Scala. As part of this effort, PipelineModel should have a Python-friendly constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8158) HiveShim improvement
[ https://issues.apache.org/jira/browse/SPARK-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8158. Resolution: Fixed Fix Version/s: 1.5.0 Assignee: Adrian Wang HiveShim improvement Key: SPARK-8158 URL: https://issues.apache.org/jira/browse/SPARK-8158 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Fix For: 1.5.0 1. explicitly import implicit conversion support. 2. use .nonEmpty instead of .size 0 3. use val instead of var 4. comment indention -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)
[ https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7917: - Priority: Minor (was: Major) Spark doesn't clean up Application Directories (local dirs) Key: SPARK-7917 URL: https://issues.apache.org/jira/browse/SPARK-7917 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zach Fry Priority: Minor Similar to SPARK-4834. Spark does clean up the cache and lock files in the local dirs, however, it doesn't clean up the actual directories. We have to write custom scripts to go back through the local dirs and find directories that don't contain any files and clear those out. Its a pretty simple repro: Run a job that does some shuffling, wait for the shuffle files to get cleaned up, go and look on disk at spark.local.dir and notice that the directory(s) are still there, but there are no files in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8148) Do not use FloatType in partition column inference
[ https://issues.apache.org/jira/browse/SPARK-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8148. Resolution: Fixed Fix Version/s: (was: 1.4.0) 1.5.0 Do not use FloatType in partition column inference -- Key: SPARK-8148 URL: https://issues.apache.org/jira/browse/SPARK-8148 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Always use DoubleType to be more stable and less error prone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Woody updated SPARK-8167: - Priority: Critical (was: Major) Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Priority: Critical Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8126) Use temp directory under build dir for unit tests
[ https://issues.apache.org/jira/browse/SPARK-8126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8126: - Fix Version/s: 1.4.1 Use temp directory under build dir for unit tests - Key: SPARK-8126 URL: https://issues.apache.org/jira/browse/SPARK-8126 Project: Spark Issue Type: Improvement Components: Build Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Labels: backport-needed Fix For: 1.4.1, 1.5.0 Spark's unit tests leave a lot of garbage in /tmp after a run, making it hard to clean things up. Let's place those files under the build dir so that mvn|sbt|git clean can do their job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8169) Add StopWordsRemover as a transformer
Xiangrui Meng created SPARK-8169: Summary: Add StopWordsRemover as a transformer Key: SPARK-8169 URL: https://issues.apache.org/jira/browse/SPARK-8169 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default. {code} val stopWords = new StopWordsRemover() .setInputCol(words) .setOutputCol(cleanWords) .setStopWords(Array(...)) // optional val output = stopWords.transform(df) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8172) Driver UI should enable viewing of dead executors' logs
Josh Rosen created SPARK-8172: - Summary: Driver UI should enable viewing of dead executors' logs Key: SPARK-8172 URL: https://issues.apache.org/jira/browse/SPARK-8172 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Josh Rosen If possible, the Spark driver UI's executor page should include a list of dead executors (perhaps of bounded size) and should have log viewer links for viewing those dead executors' logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8162) Run spark-shell cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578141#comment-14578141 ] Apache Spark commented on SPARK-8162: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/6711 Run spark-shell cause NullPointerException -- Key: SPARK-8162 URL: https://issues.apache.org/jira/browse/SPARK-8162 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.5.0 Reporter: Weizhong run spark-shell on latest master branch, then failed, details are: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) Type in expressions to have them evaluated. Type :help for more information. error: error while loading JobProgressListener, Missing dependency 'bad symbolic reference. A signature in JobProgressListener.class refers to term annotations in package com.google.common which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling JobProgressListener.class.', required by /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193) at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:68) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at
[jira] [Updated] (SPARK-8162) Run spark-shell cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8162: - Affects Version/s: 1.5.0 Run spark-shell cause NullPointerException -- Key: SPARK-8162 URL: https://issues.apache.org/jira/browse/SPARK-8162 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.5.0 Reporter: Weizhong run spark-shell on latest master branch, then failed, details are: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) Type in expressions to have them evaluated. Type :help for more information. error: error while loading JobProgressListener, Missing dependency 'bad symbolic reference. A signature in JobProgressListener.class refers to term annotations in package com.google.common which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling JobProgressListener.class.', required by /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193) at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:68) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at
[jira] [Updated] (SPARK-8162) Run spark-shell cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8162: - Priority: Blocker (was: Major) Run spark-shell cause NullPointerException -- Key: SPARK-8162 URL: https://issues.apache.org/jira/browse/SPARK-8162 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.5.0 Reporter: Weizhong Priority: Blocker run spark-shell on latest master branch, then failed, details are: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) Type in expressions to have them evaluated. Type :help for more information. error: error while loading JobProgressListener, Missing dependency 'bad symbolic reference. A signature in JobProgressListener.class refers to term annotations in package com.google.common which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling JobProgressListener.class.', required by /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193) at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:68) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at
[jira] [Created] (SPARK-8173) A class which holds all the constants should be present
sahitya pavurala created SPARK-8173: --- Summary: A class which holds all the constants should be present Key: SPARK-8173 URL: https://issues.apache.org/jira/browse/SPARK-8173 Project: Spark Issue Type: Improvement Components: Spark Core Environment: software Reporter: sahitya pavurala Priority: Minor A class which holds all the constants should be present, instead of hardcoding every where(Similar to MRConstants.java in MapReduce) All the parameter names when used should be referenced from that class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8170) Ctrl-C in pyspark shell doesn't kill running job
Ashwin Shankar created SPARK-8170: - Summary: Ctrl-C in pyspark shell doesn't kill running job Key: SPARK-8170 URL: https://issues.apache.org/jira/browse/SPARK-8170 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.1 Reporter: Ashwin Shankar Hitting Ctrl-C in spark-sql(and other tools like presto) cancels any running job and starts a new input line on the prompt. It would be nice if pyspark shell also can do that. Otherwise, in case a user submits a job, say he made a mistake, and wants to cancel it, he needs to exit the shell and re-login to continue his work. Re-login can be a pain especially in Spark on yarn, since it takes a while to allocate AM container and initial executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8171) Support Javascript-based infinite scrolling in Spark log viewers
[ https://issues.apache.org/jira/browse/SPARK-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578114#comment-14578114 ] Josh Rosen commented on SPARK-8171: --- Note: I don't plan to work on this myself; someone else should feel free to take it. Support Javascript-based infinite scrolling in Spark log viewers Key: SPARK-8171 URL: https://issues.apache.org/jira/browse/SPARK-8171 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Josh Rosen It would be cool if the Spark Web UI's log viewers supported infinite scrolling so that I can just click / scroll up to go back in the log. Maybe there's an off-the-shelf Javascript component for this. See SPARK-608, where the log viewer pagination was first introduced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8171) Support Javascript-based infinite scrolling in Spark log viewers
Josh Rosen created SPARK-8171: - Summary: Support Javascript-based infinite scrolling in Spark log viewers Key: SPARK-8171 URL: https://issues.apache.org/jira/browse/SPARK-8171 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Josh Rosen It would be cool if the Spark Web UI's log viewers supported infinite scrolling so that I can just click / scroll up to go back in the log. Maybe there's an off-the-shelf Javascript component for this. See SPARK-608, where the log viewer pagination was first introduced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8162) Run spark-shell cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8162. Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 Target Version/s: 1.4.1, 1.5.0 Run spark-shell cause NullPointerException -- Key: SPARK-8162 URL: https://issues.apache.org/jira/browse/SPARK-8162 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.4.1, 1.5.0 Reporter: Weizhong Priority: Blocker Fix For: 1.4.1, 1.5.0 run spark-shell on latest master branch, then failed, details are: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) Type in expressions to have them evaluated. Type :help for more information. error: error while loading JobProgressListener, Missing dependency 'bad symbolic reference. A signature in JobProgressListener.class refers to term annotations in package com.google.common which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling JobProgressListener.class.', required by /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193) at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:68) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at
[jira] [Commented] (SPARK-7886) Add built-in expressions to FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578049#comment-14578049 ] Apache Spark commented on SPARK-7886: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/6710 Add built-in expressions to FunctionRegistry Key: SPARK-7886 URL: https://issues.apache.org/jira/browse/SPARK-7886 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Priority: Blocker Once we do this, we no longer needs to hardcode expressions into the parser (both for internal SQL and Hive QL). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8162) Run spark-shell cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8162: - Affects Version/s: 1.4.1 Run spark-shell cause NullPointerException -- Key: SPARK-8162 URL: https://issues.apache.org/jira/browse/SPARK-8162 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.4.1, 1.5.0 Reporter: Weizhong Priority: Blocker Fix For: 1.4.1, 1.5.0 run spark-shell on latest master branch, then failed, details are: {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) Type in expressions to have them evaluated. Type :help for more information. error: error while loading JobProgressListener, Missing dependency 'bad symbolic reference. A signature in JobProgressListener.class refers to term annotations in package com.google.common which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling JobProgressListener.class.', required by /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class) java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193) at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:68) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at
[jira] [Reopened] (SPARK-8163) CheckPoint mechanism did not work well when error happened in big streaming
[ https://issues.apache.org/jira/browse/SPARK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus reopened SPARK-8163: - CheckPoint mechanism did not work well when error happened in big streaming --- Key: SPARK-8163 URL: https://issues.apache.org/jira/browse/SPARK-8163 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: SaintBacchus I tested it with Kafka DStream. Sometimes Kafka Producer had push a lot data to the Kafka Brokers, then Streaming Receiver wanted to pull this data without rate limite. At this first batch, Streaming may take 10 or more seconds to comsume this data(batch was 2 second). I wanted to describle what the Streaming do more detail at this moment: The SC was doing its job; the JobGenerator was still send new batchs to StreamingContext and StreamingContext writed this to the CheckPoint files;And the Receiver still was busy receiving the data from kafka and also tracked this events into CheckPoint. Then an error(unexcept error) occured, leading to shutdown the Streaming Application. Then we wanted to recover the application from check point files.But since the StreamingContext had record the next few batch, it would be recorvered from the last batch. So the Streaming had already missed the first batch and did not know what data had been actually comsumed by Receiver. Setting spark.streaming.concurrentJobs=2 could avoid this problem, but some application can not do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8163) CheckPoint mechanism did not work well when error happened in big streaming
[ https://issues.apache.org/jira/browse/SPARK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578161#comment-14578161 ] SaintBacchus commented on SPARK-8163: - Hi [~sowen] all the description was the problem how I meet it. Since my poor English, I think you may not understand what i say: First, Producer had push a lot data to the Kafka Brokers Second, after a while(about 10s) shutdown the streaming Third, recover it from checkpoint file The result is that Streaming skipped many batches. I really think this is a big problem, so I still reopen this issue. CheckPoint mechanism did not work well when error happened in big streaming --- Key: SPARK-8163 URL: https://issues.apache.org/jira/browse/SPARK-8163 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: SaintBacchus I tested it with Kafka DStream. Sometimes Kafka Producer had push a lot data to the Kafka Brokers, then Streaming Receiver wanted to pull this data without rate limite. At this first batch, Streaming may take 10 or more seconds to comsume this data(batch was 2 second). I wanted to describle what the Streaming do more detail at this moment: The SC was doing its job; the JobGenerator was still send new batchs to StreamingContext and StreamingContext writed this to the CheckPoint files;And the Receiver still was busy receiving the data from kafka and also tracked this events into CheckPoint. Then an error(unexcept error) occured, leading to shutdown the Streaming Application. Then we wanted to recover the application from check point files.But since the StreamingContext had record the next few batch, it would be recorvered from the last batch. So the Streaming had already missed the first batch and did not know what data had been actually comsumed by Receiver. Setting spark.streaming.concurrentJobs=2 could avoid this problem, but some application can not do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8170) Ctrl-C in pyspark shell doesn't kill running job
[ https://issues.apache.org/jira/browse/SPARK-8170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated SPARK-8170: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-7006 Ctrl-C in pyspark shell doesn't kill running job Key: SPARK-8170 URL: https://issues.apache.org/jira/browse/SPARK-8170 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.3.1 Reporter: Ashwin Shankar Hitting Ctrl-C in spark-sql(and other tools like presto) cancels any running job and starts a new input line on the prompt. It would be nice if pyspark shell also can do that. Otherwise, in case a user submits a job, say he made a mistake, and wants to cancel it, he needs to exit the shell and re-login to continue his work. Re-login can be a pain especially in Spark on yarn, since it takes a while to allocate AM container and initial executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4809) Improve Guava shading in Spark
[ https://issues.apache.org/jira/browse/SPARK-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578146#comment-14578146 ] Ronald Chen commented on SPARK-4809: This change makes no sense. This breaks SPARK-2848 attempt to move Guava's dependency off of the user's class path. Now I cannot use my own version of guava without these classes conflicting. Improve Guava shading in Spark -- Key: SPARK-4809 URL: https://issues.apache.org/jira/browse/SPARK-4809 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Fix For: 1.3.0 As part of SPARK-2848, we started shading Guava to help with projects that want to use Spark but use an incompatible version of Guava. The approach used there is a little sub-optimal, though. It makes it tricky, especially, to run unit tests in your project when those need to use spark-core APIs. We should make the shading more transparent so that it's easier to use spark-core, with or without an explicit Guava dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578240#comment-14578240 ] Ming Chen edited comment on SPARK-3577 at 6/9/15 5:26 AM: -- Why have not the metric been added? I think this is rather important, for that it may affect the results of the research work on this : https://kayousterhout.github.io/trace-analysis/ was (Author: mammothcm): Why have not the metric been added? I think this is rather important, it may affect the results of the research work on this : https://kayousterhout.github.io/trace-analysis/ Add task metric to report spill time Key: SPARK-3577 URL: https://issues.apache.org/jira/browse/SPARK-3577 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.1.0 Reporter: Kay Ousterhout Assignee: Sandy Ryza Priority: Minor The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into {{ExternalSorter}}. The write time recorded in those metrics is never used. We should probably add task metrics to report this spill time, since for shuffles, this would have previously been reported as part of shuffle write time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8174) unix_timestamp
Reynold Xin created SPARK-8174: -- Summary: unix_timestamp Key: SPARK-8174 URL: https://issues.apache.org/jira/browse/SPARK-8174 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin 3 variants: unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8174) unix_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8174: --- Description: 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: 3 variants: unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF unix_timestamp -- Key: SPARK-8174 URL: https://issues.apache.org/jira/browse/SPARK-8174 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8175) from_unixtime function
[ https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8175: --- Summary: from_unixtime function (was: from_unixtime expression) from_unixtime function -- Key: SPARK-8175 URL: https://issues.apache.org/jira/browse/SPARK-8175 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin from_unixtime(bigint unixtime[, string format]): string Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of 1970-01-01 00:00:00. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8180) day / dayofmonth function
Reynold Xin created SPARK-8180: -- Summary: day / dayofmonth function Key: SPARK-8180 URL: https://issues.apache.org/jira/browse/SPARK-8180 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin day(string date): int dayofmonth(date): int Returns the day part of a date or a timestamp string: day(1970-11-01 00:00:00) = 1, day(1970-11-01) = 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8180) day / dayofmonth function
[ https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8180: --- Description: day(string date): int dayofmonth(date): int Returns the day part of a date or a timestamp string: day(1970-11-01 00:00:00) = 1, day(1970-11-01) = 1. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: day(string date): int dayofmonth(date): int Returns the day part of a date or a timestamp string: day(1970-11-01 00:00:00) = 1, day(1970-11-01) = 1. day / dayofmonth function - Key: SPARK-8180 URL: https://issues.apache.org/jira/browse/SPARK-8180 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin day(string date): int dayofmonth(date): int Returns the day part of a date or a timestamp string: day(1970-11-01 00:00:00) = 1, day(1970-11-01) = 1. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8178) quarter function
[ https://issues.apache.org/jira/browse/SPARK-8178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8178: --- Description: quarter(string|date|timestamp): int Returns the quarter of the year for a date, timestamp, or string in the range 1 to 4. Example: quarter('2015-04-08') = 2. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: quarter(date/timestamp/string): int Returns the quarter of the year for a date, timestamp, or string in the range 1 to 4. Example: quarter('2015-04-08') = 2. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF quarter function Key: SPARK-8178 URL: https://issues.apache.org/jira/browse/SPARK-8178 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin quarter(string|date|timestamp): int Returns the quarter of the year for a date, timestamp, or string in the range 1 to 4. Example: quarter('2015-04-08') = 2. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8177) year function
[ https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8177: --- Description: year(string|date|timestamp): int Returns the year part of a date or a timestamp string: year(1970-01-01 00:00:00) = 1970, year(1970-01-01) = 1970. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: year(string date): int Returns the year part of a date or a timestamp string: year(1970-01-01 00:00:00) = 1970, year(1970-01-01) = 1970. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF year function - Key: SPARK-8177 URL: https://issues.apache.org/jira/browse/SPARK-8177 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin year(string|date|timestamp): int Returns the year part of a date or a timestamp string: year(1970-01-01 00:00:00) = 1970, year(1970-01-01) = 1970. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8180) day / dayofmonth function
[ https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8180: --- Description: day(string|date|timestamp): int dayofmonth(string|date|timestamp): int Returns the day part of a date or a timestamp string: day(1970-11-01 00:00:00) = 1, day(1970-11-01) = 1. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: day(string date): int dayofmonth(date): int Returns the day part of a date or a timestamp string: day(1970-11-01 00:00:00) = 1, day(1970-11-01) = 1. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF day / dayofmonth function - Key: SPARK-8180 URL: https://issues.apache.org/jira/browse/SPARK-8180 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin day(string|date|timestamp): int dayofmonth(string|date|timestamp): int Returns the day part of a date or a timestamp string: day(1970-11-01 00:00:00) = 1, day(1970-11-01) = 1. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8181) hour function
[ https://issues.apache.org/jira/browse/SPARK-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8181: --- Description: hour(string|date|timestamp): int Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12. was: hour(string date): int Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12. hour function - Key: SPARK-8181 URL: https://issues.apache.org/jira/browse/SPARK-8181 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin hour(string|date|timestamp): int Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8179) month function
[ https://issues.apache.org/jira/browse/SPARK-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8179: --- Description: month(string|date|timestamp): int Returns the month part of a date or a timestamp string: month(1970-11-01 00:00:00) = 11, month(1970-11-01) = 11. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: month(string date): int Returns the month part of a date or a timestamp string: month(1970-11-01 00:00:00) = 11, month(1970-11-01) = 11. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF month function -- Key: SPARK-8179 URL: https://issues.apache.org/jira/browse/SPARK-8179 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin month(string|date|timestamp): int Returns the month part of a date or a timestamp string: month(1970-11-01 00:00:00) = 11, month(1970-11-01) = 11. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8182) minute function
[ https://issues.apache.org/jira/browse/SPARK-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8182: --- Description: minute(string|date|timestamp): int Returns the minute of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: minute(string date): int Returns the minute of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF minute function --- Key: SPARK-8182 URL: https://issues.apache.org/jira/browse/SPARK-8182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin minute(string|date|timestamp): int Returns the minute of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8176) to_date function
[ https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8176: --- Description: to_date(date|timestamp): date to_date(string): string Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 1970-01-01. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: to_date(string|date|timestamp): string Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 1970-01-01. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF to_date function Key: SPARK-8176 URL: https://issues.apache.org/jira/browse/SPARK-8176 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin to_date(date|timestamp): date to_date(string): string Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 1970-01-01. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8183) second function
Reynold Xin created SPARK-8183: -- Summary: second function Key: SPARK-8183 URL: https://issues.apache.org/jira/browse/SPARK-8183 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin second(string date): int Returns the second of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8183) second function
[ https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8183: --- Target Version/s: 1.5.0 second function --- Key: SPARK-8183 URL: https://issues.apache.org/jira/browse/SPARK-8183 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin second(string date): int Returns the second of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8181) hour function
Reynold Xin created SPARK-8181: -- Summary: hour function Key: SPARK-8181 URL: https://issues.apache.org/jira/browse/SPARK-8181 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin hour(string date): int Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8182) minute function
Reynold Xin created SPARK-8182: -- Summary: minute function Key: SPARK-8182 URL: https://issues.apache.org/jira/browse/SPARK-8182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin minute(string date): int Returns the minute of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8186) date_add function
[ https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8186: --- Target Version/s: 1.5.0 date_add function - Key: SPARK-8186 URL: https://issues.apache.org/jira/browse/SPARK-8186 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin date_add(string startdate, int days): string date_add(date startdate, int days): date Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578240#comment-14578240 ] Ming Chen commented on SPARK-3577: -- Why have not the metric been added? I think this is rather important, it may affect the results of the research work on this : https://kayousterhout.github.io/trace-analysis/ Add task metric to report spill time Key: SPARK-3577 URL: https://issues.apache.org/jira/browse/SPARK-3577 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.1.0 Reporter: Kay Ousterhout Assignee: Sandy Ryza Priority: Minor The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into {{ExternalSorter}}. The write time recorded in those metrics is never used. We should probably add task metrics to report this spill time, since for shuffles, this would have previously been reported as part of shuffle write time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7886) Add built-in expressions to FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578335#comment-14578335 ] Apache Spark commented on SPARK-7886: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/6712 Add built-in expressions to FunctionRegistry Key: SPARK-7886 URL: https://issues.apache.org/jira/browse/SPARK-7886 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Priority: Blocker Once we do this, we no longer needs to hardcode expressions into the parser (both for internal SQL and Hive QL). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8174) unix_timestamp expression
[ https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8174: --- Summary: unix_timestamp expression (was: unix_timestamp) unix_timestamp expression - Key: SPARK-8174 URL: https://issues.apache.org/jira/browse/SPARK-8174 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8175) from_unixtime expression
[ https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8175: --- Summary: from_unixtime expression (was: from_unixtime) from_unixtime expression Key: SPARK-8175 URL: https://issues.apache.org/jira/browse/SPARK-8175 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin from_unixtime(bigint unixtime[, string format]): string Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of 1970-01-01 00:00:00. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8175) from_unixtime
Reynold Xin created SPARK-8175: -- Summary: from_unixtime Key: SPARK-8175 URL: https://issues.apache.org/jira/browse/SPARK-8175 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin from_unixtime(bigint unixtime[, string format]): string Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of 1970-01-01 00:00:00. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8174) unix_timestamp function
[ https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8174: --- Description: 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string|date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF unix_timestamp function --- Key: SPARK-8174 URL: https://issues.apache.org/jira/browse/SPARK-8174 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string|date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8174) unix_timestamp function
[ https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8174: --- Summary: unix_timestamp function (was: unix_timestamp expression) unix_timestamp function --- Key: SPARK-8174 URL: https://issues.apache.org/jira/browse/SPARK-8174 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8176) to_date function
Reynold Xin created SPARK-8176: -- Summary: to_date function Key: SPARK-8176 URL: https://issues.apache.org/jira/browse/SPARK-8176 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin to_date(string timestamp): string Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 1970-01-01. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8177) year function
Reynold Xin created SPARK-8177: -- Summary: year function Key: SPARK-8177 URL: https://issues.apache.org/jira/browse/SPARK-8177 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin year(string date): int Returns the year part of a date or a timestamp string: year(1970-01-01 00:00:00) = 1970, year(1970-01-01) = 1970. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8168) Add Python friendly constructor to PipelineModel
[ https://issues.apache.org/jira/browse/SPARK-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai closed SPARK-8168. -- Resolution: Fixed Issue resolved by pull request https://github.com/apache/spark/pull/6709 Add Python friendly constructor to PipelineModel Key: SPARK-8168 URL: https://issues.apache.org/jira/browse/SPARK-8168 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We are trying to migrate all Python implementations of Pipeline components to Scala. As part of this effort, PipelineModel should have a Python-friendly constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6820) Convert NAs to null type in SparkR DataFrames
[ https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-6820. -- Resolution: Fixed Fix Version/s: 1.4.1 1.5.0 Issue resolved by pull request 6190 [https://github.com/apache/spark/pull/6190] Convert NAs to null type in SparkR DataFrames - Key: SPARK-6820 URL: https://issues.apache.org/jira/browse/SPARK-6820 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Assignee: Qian Huang Priority: Critical Fix For: 1.5.0, 1.4.1 While converting RDD or local R DataFrame to a SparkR DataFrame we need to handle missing values or NAs. We should convert NAs to SparkSQL's null type to handle the conversion correctly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8177) year function
[ https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8177: --- Target Version/s: 1.5.0 year function - Key: SPARK-8177 URL: https://issues.apache.org/jira/browse/SPARK-8177 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin year(string date): int Returns the year part of a date or a timestamp string: year(1970-01-01 00:00:00) = 1970, year(1970-01-01) = 1970. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8179) month function
Reynold Xin created SPARK-8179: -- Summary: month function Key: SPARK-8179 URL: https://issues.apache.org/jira/browse/SPARK-8179 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin month(string date): int Returns the month part of a date or a timestamp string: month(1970-11-01 00:00:00) = 11, month(1970-11-01) = 11. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8178) quarter function
Reynold Xin created SPARK-8178: -- Summary: quarter function Key: SPARK-8178 URL: https://issues.apache.org/jira/browse/SPARK-8178 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin quarter(date/timestamp/string): int Returns the quarter of the year for a date, timestamp, or string in the range 1 to 4. Example: quarter('2015-04-08') = 2. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8176) to_date function
[ https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8176: --- Description: to_date(string|date|timestamp): string Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 1970-01-01. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: to_date(string timestamp): string Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 1970-01-01. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF to_date function Key: SPARK-8176 URL: https://issues.apache.org/jira/browse/SPARK-8176 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin to_date(string|date|timestamp): string Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 1970-01-01. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8184) weekofyear function
Reynold Xin created SPARK-8184: -- Summary: weekofyear function Key: SPARK-8184 URL: https://issues.apache.org/jira/browse/SPARK-8184 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin weekofyear(string|date|timestamp): int Returns the week number of a timestamp string: weekofyear(1970-11-01 00:00:00) = 44, weekofyear(1970-11-01) = 44. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8183) second function
[ https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8183: --- Description: second(string|date|timestamp): int Returns the second of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF was: second(string date): int Returns the second of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF second function --- Key: SPARK-8183 URL: https://issues.apache.org/jira/browse/SPARK-8183 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin second(string|date|timestamp): int Returns the second of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8185) datediff function
Reynold Xin created SPARK-8185: -- Summary: datediff function Key: SPARK-8185 URL: https://issues.apache.org/jira/browse/SPARK-8185 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin datediff(date enddate, date startdate): int Returns the number of days from startdate to enddate: datediff('2009-03-01', '2009-02-27') = 2. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8186) date_add function
Reynold Xin created SPARK-8186: -- Summary: date_add function Key: SPARK-8186 URL: https://issues.apache.org/jira/browse/SPARK-8186 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin date_add(string startdate, int days): string date_add(date startdate, int days): date Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8155) support of proxy user not working in spaced username on windows
[ https://issues.apache.org/jira/browse/SPARK-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaveen Raajan updated SPARK-8155: - Fix Version/s: (was: 1.3.0) support of proxy user not working in spaced username on windows --- Key: SPARK-8155 URL: https://issues.apache.org/jira/browse/SPARK-8155 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.3.1 Environment: windows-8/7/server 2008 Hadoop-2.5.2 Java-1.7.51 username - kaveen raajan Reporter: Kaveen Raajan I'm using SPARK1.3.1 on windows machine contain space on username (current username-kaveen raajan). I tried to run following command {code}spark-shell --master yarn-client --proxy-user SYSTEM {code} I able to run successfully on non-space user application also running in SYSTEM user, But When I try to run in spaced user (kaveen raajan) mean it throws following error. {code} 15/06/05 16:52:48 INFO spark.SecurityManager: Changing view acls to: SYSTEM 15/06/05 16:52:48 INFO spark.SecurityManager: Changing modify acls to: SYSTEM 15/06/05 16:52:48 INFO spark.SecurityManager: SecurityManager: authentication di sabled; ui acls disabled; users with view permissions: Set(SYSTEM); users with m odify permissions: Set(SYSTEM) 15/06/05 16:52:49 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/06/05 16:52:49 INFO Remoting: Starting remoting 15/06/05 16:52:49 INFO Remoting: Remoting started; listening on addresses :[akka .tcp://sparkDriver@Master:52137] 15/06/05 16:52:49 INFO util.Utils: Successfully started service 'sparkDriver' on port 52137. 15/06/05 16:52:49 INFO spark.SparkEnv: Registering MapOutputTracker 15/06/05 16:52:49 INFO spark.SparkEnv: Registering BlockManagerMaster 15/06/05 16:52:49 INFO storage.DiskBlockManager: Created local directory at C:\U sers\KAVEEN~1\AppData\Local\Temp\spark-d5b43891-274c-457d-aa3a-d79a536fd536\bloc kmgr-e980101b-4f93-455a-8a05-9185dcab9f8e 15/06/05 16:52:49 INFO storage.MemoryStore: MemoryStore started with capacity 26 5.4 MB 15/06/05 16:52:49 INFO spark.HttpFileServer: HTTP File server directory is C:\Us ers\KAVEEN~1\AppData\Local\Temp\spark-a35e3f17-641c-4ae3-90f2-51eac901b799\httpd -ecea93ad-c285-4c62-9222-01a9d6ff24e4 15/06/05 16:52:49 INFO spark.HttpServer: Starting HTTP Server 15/06/05 16:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/06/05 16:52:49 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0 :52138 15/06/05 16:52:49 INFO util.Utils: Successfully started service 'HTTP file serve r' on port 52138. 15/06/05 16:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator 15/06/05 16:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/06/05 16:52:49 INFO server.AbstractConnector: Started SelectChannelConnector@ 0.0.0.0:4040 15/06/05 16:52:49 INFO util.Utils: Successfully started service 'SparkUI' on por t 4040. 15/06/05 16:52:49 INFO ui.SparkUI: Started SparkUI at http://Master:4040 15/06/05 16:52:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0 :8032 java.lang.NullPointerException at org.apache.spark.sql.SQLContext.init(SQLContext.scala:145) at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:49) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct orAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC onstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:10 27) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala: 1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala: 1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840 ) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:8 56)
[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns
[ https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576639#comment-14576639 ] Reynold Xin commented on SPARK-8072: Yes - adding a rule in CheckAnalysis.scala to check the column names would work. Better AnalysisException for writing DataFrame with identically named columns - Key: SPARK-8072 URL: https://issues.apache.org/jira/browse/SPARK-8072 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker We should check if there are duplicate columns, and if yes, throw an explicit error message saying there are duplicate columns. See current error message below. {code} In [3]: df.withColumn('age', df.age) Out[3]: DataFrame[age: bigint, name: string, age: bigint] In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out') --- Py4JJavaError Traceback (most recent call last) ipython-input-4-eecb85256898 in module() 1 df.withColumn('age', df.age).write.parquet('test-parquet.out') /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, mode) 350 df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) 351 -- 352 self._jwrite.mode(mode).parquet(path) 353 354 @since(1.4) /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc in __call__(self, *args) 535 answer = self.gateway_client.send_command(command) 536 return_value = get_return_value(answer, self.gateway_client, -- 537 self.target_id, self.name) 538 539 for temp_arg in temp_args: /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o35.parquet. : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could be: age#0L, age#3L.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at
[jira] [Resolved] (SPARK-8154) Remove Term/Code type aliases in code generation
[ https://issues.apache.org/jira/browse/SPARK-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8154. Resolution: Fixed Fix Version/s: 1.5.0 Remove Term/Code type aliases in code generation Key: SPARK-8154 URL: https://issues.apache.org/jira/browse/SPARK-8154 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 From my perspective as a code reviewer, I find them more confusing than using String directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8153) Add configuration for disabling partial aggregation in runtime
[ https://issues.apache.org/jira/browse/SPARK-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8153: --- Assignee: Apache Spark Add configuration for disabling partial aggregation in runtime -- Key: SPARK-8153 URL: https://issues.apache.org/jira/browse/SPARK-8153 Project: Spark Issue Type: Improvement Components: SQL Reporter: Navis Assignee: Apache Spark Priority: Trivial Same thing with hive.map.aggr.hash.min.reduction in hive, which disables hash aggregation if it's not sufficiently decreasing the output size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8096) how to convert dataframe field to LabelPoint
[ https://issues.apache.org/jira/browse/SPARK-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573922#comment-14573922 ] bofei.xiao edited comment on SPARK-8096 at 6/8/15 6:28 AM: --- I'm sorry i haven't expressed my question clearly! was (Author: bofei.xiao): I'm sorry i haven't express my question clearly! how to convert dataframe field to LabelPoint Key: SPARK-8096 URL: https://issues.apache.org/jira/browse/SPARK-8096 Project: Spark Issue Type: Bug Reporter: bofei.xiao how to convert the dataframe to RDD[LabelPoint] dataframe with fields target,age,sex,height i want to cast target as label,age,sex,height as features vector I faced this problem in the following circumstance: -- given i have a csv file data.csv target,age,sex,height 1,18,1,170 0,25,1,165 . now,i want build a decisitin model step 1:load csv data as dataframe val data= sqlContext.load(com.databricks.spark.csv,:Map(path - data.csv, header - true) step 2:build a decisiontree model but decisiontree need a RDD[LabelPoint] input thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8159) Improve expression coverage
Reynold Xin created SPARK-8159: -- Summary: Improve expression coverage Key: SPARK-8159 URL: https://issues.apache.org/jira/browse/SPARK-8159 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin This is an umbrella ticket to track new expressions we are adding to SQL/DataFrame. For each new expression, we should implement the code generated version as well as comprehensive unit tests (for all the data types the expressions support). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576721#comment-14576721 ] Saisai Shao commented on SPARK-4352: Hi [~sandyr], I re-implement the allocation strategy according to the discussion, now the new strategy will consider: 1. allocating new containers with with locality distribution ratio for locality required tasks, and for locality-free tasks let Yarn to decide the location as previous code. 2. Combine the existing executor/host distribution with new required containers to better suit for dynamic allocation. 3. If dynamic allocation is not enabled, or preferred locality is empty, the code logic is the same as previous code. 4. locality required executors will have the high priority to be allocated with new container if current containers are not enough to match all the locality required tasks. Please help to review, any comment is greatly appreciated. Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8157) should use expression.unapply to match in HiveTypeCoercion
[ https://issues.apache.org/jira/browse/SPARK-8157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-8157. -- Resolution: Not A Problem see https://github.com/apache/spark/pull/6698 should use expression.unapply to match in HiveTypeCoercion -- Key: SPARK-8157 URL: https://issues.apache.org/jira/browse/SPARK-8157 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang This is a bug introduced by #6405 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8157) should use expression.unapply to match in HiveTypeCoercion
[ https://issues.apache.org/jira/browse/SPARK-8157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8157: --- Assignee: Apache Spark should use expression.unapply to match in HiveTypeCoercion -- Key: SPARK-8157 URL: https://issues.apache.org/jira/browse/SPARK-8157 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Assignee: Apache Spark This is a bug introduced by #6405 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8157) should use expression.unapply to match in HiveTypeCoercion
Adrian Wang created SPARK-8157: -- Summary: should use expression.unapply to match in HiveTypeCoercion Key: SPARK-8157 URL: https://issues.apache.org/jira/browse/SPARK-8157 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang This is a bug introduced by #6405 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8158) HiveShim improvement
[ https://issues.apache.org/jira/browse/SPARK-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-8158: --- Component/s: SQL HiveShim improvement Key: SPARK-8158 URL: https://issues.apache.org/jira/browse/SPARK-8158 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang 1. explicitly import implicit conversion support. 2. use .nonEmpty instead of .size 0 3. use val instead of var 4. comment indention -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7082) Binary processing external sort-merge join
[ https://issues.apache.org/jira/browse/SPARK-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7082: --- Summary: Binary processing external sort-merge join (was: Binary processing sort-merge join) Binary processing external sort-merge join -- Key: SPARK-7082 URL: https://issues.apache.org/jira/browse/SPARK-7082 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8160) Tungsten style external aggregation
Reynold Xin created SPARK-8160: -- Summary: Tungsten style external aggregation Key: SPARK-8160 URL: https://issues.apache.org/jira/browse/SPARK-8160 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Support using external sorting to run aggregate so we can easily process aggregates where each partition is much larger than memory size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7792) HiveContext registerTempTable not thread safe
[ https://issues.apache.org/jira/browse/SPARK-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576681#comment-14576681 ] Apache Spark commented on SPARK-7792: - User 'navis' has created a pull request for this issue: https://github.com/apache/spark/pull/6699 HiveContext registerTempTable not thread safe - Key: SPARK-7792 URL: https://issues.apache.org/jira/browse/SPARK-7792 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Yana Kadiyska {code:java} public class ThreadRepro { public static void main(String[] args) throws Exception{ new ThreadRepro().sparkPerfTest(); } public void sparkPerfTest(){ final AtomicLong counter = new AtomicLong(); SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local[7]); SparkContext sc = new SparkContext(conf); org.apache.spark.sql.hive.HiveContext hc = new org.apache.spark.sql.hive.HiveContext(sc); int poolSize = 10; ExecutorService pool = Executors.newFixedThreadPool(poolSize); for (int i=0; ipoolSize;i++ ) pool.execute(new QueryJob(hc, i, counter)); pool.shutdown(); try { pool.awaitTermination(60, TimeUnit.MINUTES); }catch(Exception e){ System.out.println(Thread interrupted); } System.out.println(All jobs complete); System.out.println( Counter is +counter.get()); } } class QueryJob implements Runnable{ String threadId; org.apache.spark.sql.hive.HiveContext sqlContext; String key; AtomicLong counter; final AtomicLong local_counter = new AtomicLong(); public QueryJob(org.apache.spark.sql.hive.HiveContext _sqlContext,int id,AtomicLong ctr){ threadId = thread_+id; this.sqlContext= _sqlContext; this.counter = ctr; } public void run() { for (int i = 0; i 100; i++) { String tblName = threadId +_+i; DataFrame df = sqlContext.emptyDataFrame(); df.registerTempTable(tblName); String _query = String.format(select count(*) from %s,tblName); System.out.println(String.format( registered table %s; catalog (%s) ,tblName,debugTables())); ListRow res; try { res = sqlContext.sql(_query).collectAsList(); }catch (Exception e){ System.out.println(*Exception + debugTables() +**); throw e; } sqlContext.dropTempTable(tblName); System.out.println( dropped table +tblName); try { Thread.sleep(3000);//lets make this a not-so-tight loop }catch(Exception e){ System.out.println(Thread interrupted); } } } private String debugTables(){ String v = Joiner.on(',').join(sqlContext.tableNames()); if (v==null)return ; else return v; } } {code} this will periodically produce the following: {quote} registered table thread_0_50; catalog (thread_1_50) registered table thread_4_50; catalog (thread_4_50,thread_1_50) registered table thread_1_50; catalog (thread_1_50) dropped table thread_1_50 dropped table thread_4_50 *Exception ** Exception in thread pool-6-thread-1 java.lang.Error: org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 21 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 21 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:177) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
[jira] [Assigned] (SPARK-7792) HiveContext registerTempTable not thread safe
[ https://issues.apache.org/jira/browse/SPARK-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7792: --- Assignee: Apache Spark HiveContext registerTempTable not thread safe - Key: SPARK-7792 URL: https://issues.apache.org/jira/browse/SPARK-7792 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Yana Kadiyska Assignee: Apache Spark {code:java} public class ThreadRepro { public static void main(String[] args) throws Exception{ new ThreadRepro().sparkPerfTest(); } public void sparkPerfTest(){ final AtomicLong counter = new AtomicLong(); SparkConf conf = new SparkConf(); conf.setAppName(My Application); conf.setMaster(local[7]); SparkContext sc = new SparkContext(conf); org.apache.spark.sql.hive.HiveContext hc = new org.apache.spark.sql.hive.HiveContext(sc); int poolSize = 10; ExecutorService pool = Executors.newFixedThreadPool(poolSize); for (int i=0; ipoolSize;i++ ) pool.execute(new QueryJob(hc, i, counter)); pool.shutdown(); try { pool.awaitTermination(60, TimeUnit.MINUTES); }catch(Exception e){ System.out.println(Thread interrupted); } System.out.println(All jobs complete); System.out.println( Counter is +counter.get()); } } class QueryJob implements Runnable{ String threadId; org.apache.spark.sql.hive.HiveContext sqlContext; String key; AtomicLong counter; final AtomicLong local_counter = new AtomicLong(); public QueryJob(org.apache.spark.sql.hive.HiveContext _sqlContext,int id,AtomicLong ctr){ threadId = thread_+id; this.sqlContext= _sqlContext; this.counter = ctr; } public void run() { for (int i = 0; i 100; i++) { String tblName = threadId +_+i; DataFrame df = sqlContext.emptyDataFrame(); df.registerTempTable(tblName); String _query = String.format(select count(*) from %s,tblName); System.out.println(String.format( registered table %s; catalog (%s) ,tblName,debugTables())); ListRow res; try { res = sqlContext.sql(_query).collectAsList(); }catch (Exception e){ System.out.println(*Exception + debugTables() +**); throw e; } sqlContext.dropTempTable(tblName); System.out.println( dropped table +tblName); try { Thread.sleep(3000);//lets make this a not-so-tight loop }catch(Exception e){ System.out.println(Thread interrupted); } } } private String debugTables(){ String v = Joiner.on(',').join(sqlContext.tableNames()); if (v==null)return ; else return v; } } {code} this will periodically produce the following: {quote} registered table thread_0_50; catalog (thread_1_50) registered table thread_4_50; catalog (thread_4_50,thread_1_50) registered table thread_1_50; catalog (thread_1_50) dropped table thread_1_50 dropped table thread_4_50 *Exception ** Exception in thread pool-6-thread-1 java.lang.Error: org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 21 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 21 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:177) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at
[jira] [Created] (SPARK-8158) HiveShim improvement
Adrian Wang created SPARK-8158: -- Summary: HiveShim improvement Key: SPARK-8158 URL: https://issues.apache.org/jira/browse/SPARK-8158 Project: Spark Issue Type: Improvement Reporter: Adrian Wang 1. explicitly import implicit conversion support. 2. use .nonEmpty instead of .size 0 3. use val instead of var 4. comment indention -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8011) DecimalType is not a datatype
[ https://issues.apache.org/jira/browse/SPARK-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bipin Roshan Nag resolved SPARK-8011. - Resolution: Not A Problem DecimalType is not a datatype - Key: SPARK-8011 URL: https://issues.apache.org/jira/browse/SPARK-8011 Project: Spark Issue Type: Bug Components: Java API, Spark Core Affects Versions: 1.3.1 Reporter: Bipin Roshan Nag When I run the following in spark-shell : StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) I get console:50: error: type mismatch; found : org.apache.spark.sql.types.DecimalType.type required: org.apache.spark.sql.types.DataType StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8011) DecimalType is not a datatype
[ https://issues.apache.org/jira/browse/SPARK-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576709#comment-14576709 ] Bipin Roshan Nag commented on SPARK-8011: - It works. Thanks. DecimalType is not a datatype - Key: SPARK-8011 URL: https://issues.apache.org/jira/browse/SPARK-8011 Project: Spark Issue Type: Bug Components: Java API, Spark Core Affects Versions: 1.3.1 Reporter: Bipin Roshan Nag When I run the following in spark-shell : StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) I get console:50: error: type mismatch; found : org.apache.spark.sql.types.DecimalType.type required: org.apache.spark.sql.types.DataType StructType(StructField(ID,IntegerType,true), StructField(Value,DecimalType,true)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org