[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilterScanSuite

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8166:
-
Description: 
See screenshot. It's been OOM'ing on Master SBT tests specifically with the 
Hadoop 2.3 profile.

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
... 10+ others

  was:
It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile.

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
... 10+ others


 Consistently failing test: o.a.s.sql.sources.FilterScanSuite
 

 Key: SPARK-8166
 URL: https://issues.apache.org/jira/browse/SPARK-8166
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.4.1
Reporter: Andrew Or
Assignee: Yin Huai
Priority: Blocker
 Attachments: master SBT.png


 See screenshot. It's been OOM'ing on Master SBT tests specifically with the 
 Hadoop 2.3 profile.
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 ... 10+ others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilterScanSuite

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8166:
-
Attachment: master SBT.png

 Consistently failing test: o.a.s.sql.sources.FilterScanSuite
 

 Key: SPARK-8166
 URL: https://issues.apache.org/jira/browse/SPARK-8166
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.4.1
Reporter: Andrew Or
Assignee: Yin Huai
Priority: Blocker
 Attachments: master SBT.png


 It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 
 profile.
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 ... 10+ others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8168) Add Python friendly constructor to PipelineModel

2015-06-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577748#comment-14577748
 ] 

Apache Spark commented on SPARK-8168:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/6709

 Add Python friendly constructor to PipelineModel
 

 Key: SPARK-8168
 URL: https://issues.apache.org/jira/browse/SPARK-8168
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We are trying to migrate all Python implementations of Pipeline components to 
 Scala. As part of this effort, PipelineModel should have a Python-friendly 
 constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8168) Add Python friendly constructor to PipelineModel

2015-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8168:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 Add Python friendly constructor to PipelineModel
 

 Key: SPARK-8168
 URL: https://issues.apache.org/jira/browse/SPARK-8168
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We are trying to migrate all Python implementations of Pipeline components to 
 Scala. As part of this effort, PipelineModel should have a Python-friendly 
 constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8166:
-
Affects Version/s: (was: 1.4.1)

 Consistently failing test: o.a.s.sql.sources.FilteredScanSuite
 --

 Key: SPARK-8166
 URL: https://issues.apache.org/jira/browse/SPARK-8166
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Yin Huai
Priority: Blocker
 Attachments: master SBT.png


 See screenshot. It's been OOM'ing on Master SBT tests specifically with the 
 Hadoop 2.3 profile.
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 ... 10+ others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8166:
-
Target Version/s: 1.5.0  (was: 1.4.1, 1.5.0)

 Consistently failing test: o.a.s.sql.sources.FilteredScanSuite
 --

 Key: SPARK-8166
 URL: https://issues.apache.org/jira/browse/SPARK-8166
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Yin Huai
Priority: Blocker
 Attachments: master SBT.png


 See screenshot. It's been OOM'ing on Master SBT tests specifically with the 
 Hadoop 2.3 profile.
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 ... 10+ others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8168) Add Python friendly constructor to PipelineModel

2015-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8168:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 Add Python friendly constructor to PipelineModel
 

 Key: SPARK-8168
 URL: https://issues.apache.org/jira/browse/SPARK-8168
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark

 We are trying to migrate all Python implementations of Pipeline components to 
 Scala. As part of this effort, PipelineModel should have a Python-friendly 
 constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-06-08 Thread Patrick Woody (JIRA)
Patrick Woody created SPARK-8167:


 Summary: Tasks that fail due to YARN preemption can cause job 
failure
 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody


Tasks that are running on preempted executors will count as FAILED with an 
ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a 
large resource shift is occurring, and the tasks get scheduled to executors 
that immediately get preempted as well.

The current workaround is to increase spark.task.maxFailures very high, but 
that can cause delays in true failures. We should ideally differentiate these 
task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8062) NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics

2015-06-08 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8062:
--
Comment: was deleted

(was: While working to try to reproduce this bug, I noticed something rather 
curious:

In {{InputOutputMetricsSuite}}, the output metrics tests are guarded by {{if}} 
statements that check whether the bytesWrittenOnThreadCallback is defined:

{code}
test(output metrics when writing text file) {
val fs = FileSystem.getLocal(new Configuration())
val outPath = new Path(fs.getWorkingDirectory, outdir)

if (SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback(outPath, 
fs.getConf).isDefined) {
  // ... Body of test case ...
}
  }
{code}

AFAIK this test was introduced in order to prevent this test's assertions from 
failing under pre-Hadoop-2.5 versions of Hadoop.

Now, take a look at the regression test that I added to try to reproduce this 
bug:

{code}

  test(exceptions while getting IO thread statistics should not fail tasks / 
jobs (SPARK-8062)) {
FileSystem.getStatistics(null, classOf[FileSystem])


val fs = FileSystem.getLocal(new Configuration())
val outPath = new Path(fs.getWorkingDirectory, outdir)
// This test passes unless the following line is commented out.  The 
following line therefore
// has some side-effects that are impacting the system under test:
SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback(outPath, 
fs.getConf).isDefined
val rdd = sc.parallelize(Array(a, b, c, d), 2)

try {
  rdd.saveAsTextFile(outPath.toString)
} finally {
  fs.delete(outPath, true)
}
  }
{code}

In this test, I try to pollute the global FileSystem statistics registry by 
storing a statistics entry for a filesystem with a null URI.  For this test, 
all I care about is Spark not crashing, so I didn't add the {{if}} check (I 
don't need to worry about the assertions failing on pre-Hadoop-2.5 versions 
here since there aren't any assertions that check the metrics for this test).

Surprisingly, though, my test was unable to fail until I added a 

{code}
SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback(outPath, 
fs.getConf).isDefined
{code}

check outside of an {{if}} statement.  This implies that this method has side 
effects which influence whether other metrics retrieval code is called.  I 
worry that this may imply that our other InputOutputMetrics code could be 
broken for real production jobs.  I'd like to investigate this and fix this 
issue, while also hardening this code: I think that we should be performing 
significantly more null checks for the input and output of Hadoop methods and 
should be using a pure function to determine whether our Hadoop version 
supports these metrics rather than calling a method that might have 
side-effects (I think we can do this purely via reflection without actually 
creating any objects / calling any methods).

Since this JIRA is somewhat time sensitive, though, I'm going to work on a 
patch just for the null checks here, then will open a followup to investigate 
further hardening of the input output metrics code.)

 NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics
 -

 Key: SPARK-8062
 URL: https://issues.apache.org/jira/browse/SPARK-8062
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: MapR 4.0.1, Hadoop 2.4.1, Yarn
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.2.3


 I received the following error report from a user:
 While running a Spark Streaming job that reads from MapRfs and writes to 
 HBase using Spark 1.2.1, the job intermittently experiences a total job 
 failure due to the following errors:
 {code}
 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 1.1 in stage 6.0 
 (TID 24) 
 java.lang.NullPointerException 
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
  
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
 at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) 
 at scala.collection.AbstractTraversable.filter(Traversable.scala:105) 
 at 
 org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
  
 at 

[jira] [Resolved] (SPARK-8062) NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics

2015-06-08 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8062.
---
   Resolution: Fixed
Fix Version/s: 1.2.3

Issue resolved by pull request 6618
[https://github.com/apache/spark/pull/6618]

 NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics
 -

 Key: SPARK-8062
 URL: https://issues.apache.org/jira/browse/SPARK-8062
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: MapR 4.0.1, Hadoop 2.4.1, Yarn
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.2.3


 I received the following error report from a user:
 While running a Spark Streaming job that reads from MapRfs and writes to 
 HBase using Spark 1.2.1, the job intermittently experiences a total job 
 failure due to the following errors:
 {code}
 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 1.1 in stage 6.0 
 (TID 24) 
 java.lang.NullPointerException 
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
  
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
 at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) 
 at scala.collection.AbstractTraversable.filter(Traversable.scala:105) 
 at 
 org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
  
 at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:116) 
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) 
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) 
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
 at org.apache.spark.scheduler.Task.run(Task.scala:56) 
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) 
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  
 at java.lang.Thread.run(Thread.java:744) 
 15/05/28 10:35:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 25 
 15/05/28 10:35:50 INFO executor.Executor: Running task 2.1 in stage 6.0 (TID 
 25) 
 15/05/28 10:35:50 INFO rdd.NewHadoopRDD: Input split: hdfs:/[REDACTED] 
 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 2.1 in stage 6.0 
 (TID 25) 
 java.lang.NullPointerException 
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
  
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
 {code}
 Diving into the code here:
 The NPE is occurring on this line of SparkHadoopUtil (in 1.2.1.): 
 https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L178
 Here's that block of code from 1.2.1 (it's the same in 1.2.2):
 {code}
   private def getFileSystemThreadStatistics(path: Path, conf: Configuration): 
 Seq[AnyRef] = {
 val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
 val scheme = qualifiedPath.toUri().getScheme()
 val stats = 
 FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))   // 

[jira] [Issue Comment Deleted] (SPARK-8062) NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics

2015-06-08 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8062:
--
Comment: was deleted

(was: Alright, I've filed https://issues.apache.org/jira/browse/SPARK-8086 to 
follow up on hardening InputOutputMetricsSuite.)

 NullPointerException in SparkHadoopUtil.getFileSystemThreadStatistics
 -

 Key: SPARK-8062
 URL: https://issues.apache.org/jira/browse/SPARK-8062
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: MapR 4.0.1, Hadoop 2.4.1, Yarn
Reporter: Josh Rosen
Assignee: Josh Rosen
 Fix For: 1.2.3


 I received the following error report from a user:
 While running a Spark Streaming job that reads from MapRfs and writes to 
 HBase using Spark 1.2.1, the job intermittently experiences a total job 
 failure due to the following errors:
 {code}
 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 1.1 in stage 6.0 
 (TID 24) 
 java.lang.NullPointerException 
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
  
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
 at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263) 
 at scala.collection.AbstractTraversable.filter(Traversable.scala:105) 
 at 
 org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
  
 at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:116) 
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107) 
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) 
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) 
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) 
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
 at org.apache.spark.scheduler.Task.run(Task.scala:56) 
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) 
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  
 at java.lang.Thread.run(Thread.java:744) 
 15/05/28 10:35:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 25 
 15/05/28 10:35:50 INFO executor.Executor: Running task 2.1 in stage 6.0 (TID 
 25) 
 15/05/28 10:35:50 INFO rdd.NewHadoopRDD: Input split: hdfs:/[REDACTED] 
 15/05/28 10:35:50 ERROR executor.Executor: Exception in task 2.1 in stage 6.0 
 (TID 25) 
 java.lang.NullPointerException 
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anonfun$4.apply(SparkHadoopUtil.scala:178)
  
 at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
  
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
 {code}
 Diving into the code here:
 The NPE is occurring on this line of SparkHadoopUtil (in 1.2.1.): 
 https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L178
 Here's that block of code from 1.2.1 (it's the same in 1.2.2):
 {code}
   private def getFileSystemThreadStatistics(path: Path, conf: Configuration): 
 Seq[AnyRef] = {
 val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
 val scheme = qualifiedPath.toUri().getScheme()
 val stats = 
 

[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-06-08 Thread Alex (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577599#comment-14577599
 ] 

Alex commented on SPARK-2344:
-

Hi guys,
What is the status of this issue? 

Beniamino - are you planning to submit you version of the algorithm?


 Add Fuzzy C-Means algorithm to MLlib
 

 Key: SPARK-2344
 URL: https://issues.apache.org/jira/browse/SPARK-2344
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Alex
Priority: Minor
  Labels: clustering
   Original Estimate: 1m
  Remaining Estimate: 1m

 I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
 FCM is very similar to K - Means which is already implemented, and they 
 differ only in the degree of relationship each point has with each cluster:
 (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
 As part of the implementation I would like:
 - create a base class for K- Means and FCM
 - implement the relationship for each algorithm differently (in its class)
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8116) sc.range() doesn't match python range()

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8116.

   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.1
 Assignee: Ted Blackman

 sc.range() doesn't match python range()
 ---

 Key: SPARK-8116
 URL: https://issues.apache.org/jira/browse/SPARK-8116
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0, 1.4.1
Reporter: Ted Blackman
Assignee: Ted Blackman
Priority: Minor
  Labels: easyfix
 Fix For: 1.4.1, 1.5.0


 Python's built-in range() and xrange() functions can take 1, 2, or 3 
 arguments. Ranges with just 1 argument are probably used the most frequently, 
 e.g.:
 for i in range(len(myList)): ...
 However, in pyspark, the SparkContext range() method throws an error when 
 called with a single argument, due to the way its arguments get passed into 
 python's range function.
 There's no good reason that I can think of not to support the same syntax as 
 the built-in function. To fix this, we can set the default of the sc.range() 
 method's `stop` argument to None, and then inside the method, if it is None, 
 replace `stop` with `start` and set `start` to 0, which is what the c 
 implementation of range() does:
 https://github.com/python/cpython/blob/master/Objects/rangeobject.c#L87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7426) spark.ml AttributeFactory.fromStructField should allow other NumericTypes

2015-06-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577648#comment-14577648
 ] 

Joseph K. Bradley commented on SPARK-7426:
--

Hi, I'm sorry for the slow review!  I'll try to look soon...

 spark.ml AttributeFactory.fromStructField should allow other NumericTypes
 -

 Key: SPARK-7426
 URL: https://issues.apache.org/jira/browse/SPARK-7426
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor

 It currently only supports DoubleType, but it should support others, at least 
 for fromStructField (importing into ML attribute format, rather than 
 exporting).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8166) Failing test: o.a.s.sql.sources.FilterScanSuite

2015-06-08 Thread Andrew Or (JIRA)
Andrew Or created SPARK-8166:


 Summary: Failing test: o.a.s.sql.sources.FilterScanSuite
 Key: SPARK-8166
 URL: https://issues.apache.org/jira/browse/SPARK-8166
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.4.1
Reporter: Andrew Or
Assignee: Yin Huai
Priority: Blocker


It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 profile.

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
... 10+ others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilterScanSuite

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8166:
-
Summary: Consistently failing test: o.a.s.sql.sources.FilterScanSuite  
(was: Failing test: o.a.s.sql.sources.FilterScanSuite)

 Consistently failing test: o.a.s.sql.sources.FilterScanSuite
 

 Key: SPARK-8166
 URL: https://issues.apache.org/jira/browse/SPARK-8166
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.4.1
Reporter: Andrew Or
Assignee: Yin Huai
Priority: Blocker

 It's been OOM'ing on Master SBT tests specifically with the Hadoop 2.3 
 profile.
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 ... 10+ others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8166:
-
Summary: Consistently failing test: o.a.s.sql.sources.FilteredScanSuite  
(was: Consistently failing test: o.a.s.sql.sources.FilterScanSuite)

 Consistently failing test: o.a.s.sql.sources.FilteredScanSuite
 --

 Key: SPARK-8166
 URL: https://issues.apache.org/jira/browse/SPARK-8166
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.4.1, 1.5.0
Reporter: Andrew Or
Assignee: Yin Huai
Priority: Blocker
 Attachments: master SBT.png


 See screenshot. It's been OOM'ing on Master SBT tests specifically with the 
 Hadoop 2.3 profile.
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 ... 10+ others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8102) Big performance difference when joining 3 tables in different order

2015-06-08 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577602#comment-14577602
 ] 

Yin Huai commented on SPARK-8102:
-

For the first query,
{code}
-- snippet 1
SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
FROM t_category c, t_zipcode z, click_meter_site_grouped g
WHERE c.refCategoryID = g.category AND z.regionCode = g.region
{code}
I guess the analyzer got confused because first two tables are c and z but 
there is no join condition between them. So, a {{CartesianProduct}} was used. 
Although your tables can be small, but the result of a {{CartesianProduct}} can 
be huge. 

I guess we can do something in the analyzer to get rid of the 
{{CartesianProduct}}.

 Big performance difference when joining 3 tables in different order
 ---

 Key: SPARK-8102
 URL: https://issues.apache.org/jira/browse/SPARK-8102
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
 Environment: spark in local mode
Reporter: Hao Ren
 Attachments: query2job.png, query3job.png


 Given 3 tables loaded from CSV files: 
 ( tables name = size)
 *click_meter_site_grouped* =10 687 455 bytes
 *t_zipcode* = 2 738 954 bytes
 *t_category* = 2 182 bytes
 When joining the 3 tables, I notice a large performance difference if they 
 are joined in different order.
 Here are the SQL queries to compare:
 {code}
 -- snippet 1
 SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
 FROM t_category c, t_zipcode z, click_meter_site_grouped g
 WHERE c.refCategoryID = g.category AND z.regionCode = g.region
 {code}
 {code}
 -- snippet 2
 SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
 FROM t_category c, click_meter_site_grouped g, t_zipcode z
 WHERE c.refCategoryID = g.category AND z.regionCode = g.region
 {code}
 As you see, the largest table *click_meter_site_grouped* is the last table in 
 FROM clause in the first snippet,  and it is in the middle of table list in 
 second one.
 Snippet 2 runs three times faster than Snippet 1.
 (8 seconds VS 24 seconds)
 As the data is just sampled from a large data set, if we test it on the 
 original data set, it will normally result in a performance issue.
 After checking the log, we found something strange In snippet 1's log:
 15/06/04 15:32:03 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:04 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:04 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:05 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:06 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:07 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:08 INFO HadoopRDD: Input split: 
 file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
 15/06/04 15:32:09 INFO HadoopRDD: Input split: 
 

[jira] [Updated] (SPARK-8166) Consistently failing test: o.a.s.sql.sources.FilteredScanSuite

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8166:
-
Affects Version/s: 1.5.0

 Consistently failing test: o.a.s.sql.sources.FilteredScanSuite
 --

 Key: SPARK-8166
 URL: https://issues.apache.org/jira/browse/SPARK-8166
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.4.1, 1.5.0
Reporter: Andrew Or
Assignee: Yin Huai
Priority: Blocker
 Attachments: master SBT.png


 See screenshot. It's been OOM'ing on Master SBT tests specifically with the 
 Hadoop 2.3 profile.
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2563/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2582/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2573/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/
 ... 10+ others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7426) spark.ml AttributeFactory.fromStructField should allow other NumericTypes

2015-06-08 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577629#comment-14577629
 ] 

Mike Dusenberry commented on SPARK-7426:


[~josephkb] Do you have any thoughts on this PR?

 spark.ml AttributeFactory.fromStructField should allow other NumericTypes
 -

 Key: SPARK-7426
 URL: https://issues.apache.org/jira/browse/SPARK-7426
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor

 It currently only supports DoubleType, but it should support others, at least 
 for fromStructField (importing into ML attribute format, rather than 
 exporting).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-06-08 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577727#comment-14577727
 ] 

Matt Cheah commented on SPARK-8167:
---

To be clear this is independent of SPARK-7451. SPARK-7451 helps for the case 
that executors die too many times from preemption, but it doesn't not help if 
the exact same task gets preempted many times.

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-06-08 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577552#comment-14577552
 ] 

Kannan Rajah commented on SPARK-1403:
-

A similar problem has been reported while running Spark on Mesos using MapR 
distribution. The current thread's context class loader is NULL inside the 
executor process causing NPE in MapR code.

Refer to this discussion.
http://answers.mapr.com/questions/163353/spark-from-apache-downloads-site-for-mapr.html#answer-163484

 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8121) When using with Hadoop 1.x, spark.sql.parquet.output.committer.class is overriden by spark.sql.sources.outputCommitterClass

2015-06-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-8121.
-
   Resolution: Fixed
Fix Version/s: 1.4.1

Issue resolved by pull request 6705
[https://github.com/apache/spark/pull/6705]

 When using with Hadoop 1.x, spark.sql.parquet.output.committer.class is 
 overriden by spark.sql.sources.outputCommitterClass
 ---

 Key: SPARK-8121
 URL: https://issues.apache.org/jira/browse/SPARK-8121
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.4.1


 When using Spark with Hadoop 1.x (the version I tested is 1.2.0) and 
 {{spark.sql.sources.outputCommitterClass}} is configured, 
 {{spark.sql.parquet.output.committer.class}} will be overriden. 
 For example, if {{spark.sql.parquet.output.committer.class}} is set to 
 {{FileOutputCommitter}}, while {{spark.sql.sources.outputCommitterClass}} is 
 set to {{DirectParquetOutputCommitter}}, neither {{_metadata}} nor 
 {{_common_metadata}} will be written because {{FileOutputCommitter}} 
 overrides {{DirectParquetOutputCommitter}}.
 The reason is that, {{InsertIntoHadoopFsRelation}} initializes the 
 {{TaskAttemptContext}} before calling 
 {{ParquetRelation2.prepareForWriteJob()}}, which sets up Parquet output 
 committer class. In the meanwhile, in Hadoop 1.x, {{TaskAttempContext}} 
 constructor clones the job configuration, thus doesn't share the job 
 configuration passed to {{ParquetRelation2.prepareForWriteJob()}}.
 This issue can be fixed by simply [switching these two 
 lines|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/commands.scala#L285-L286].
 Here is a Spark shell snippet for reproducing this issue:
 {code}
 import sqlContext._
 sc.hadoopConfiguration.set(
   spark.sql.sources.outputCommitterClass,
   org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter)
 sc.hadoopConfiguration.set(
   spark.sql.parquet.output.committer.class,
   org.apache.spark.sql.parquet.DirectParquetOutputCommitter)
 range(0, 1).write.mode(overwrite).parquet(file:///tmp/foo)
 {code}
 Then check {{/tmp/foo}}, Parquet summary files are missing:
 {noformat}
 /tmp/foo
 ├── _SUCCESS
 ├── part-r-1.gz.parquet
 ├── part-r-2.gz.parquet
 ├── part-r-3.gz.parquet
 ├── part-r-4.gz.parquet
 ├── part-r-5.gz.parquet
 ├── part-r-6.gz.parquet
 ├── part-r-7.gz.parquet
 └── part-r-8.gz.parquet
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8168) Add Python friendly constructor to PipelineModel

2015-06-08 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8168:


 Summary: Add Python friendly constructor to PipelineModel
 Key: SPARK-8168
 URL: https://issues.apache.org/jira/browse/SPARK-8168
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


We are trying to migrate all Python implementations of Pipeline components to 
Scala. As part of this effort, PipelineModel should have a Python-friendly 
constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8158) HiveShim improvement

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8158.

   Resolution: Fixed
Fix Version/s: 1.5.0
 Assignee: Adrian Wang

 HiveShim improvement
 

 Key: SPARK-8158
 URL: https://issues.apache.org/jira/browse/SPARK-8158
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
 Fix For: 1.5.0


 1. explicitly import implicit conversion support.
 2. use .nonEmpty instead of .size  0
 3. use val instead of var
 4. comment indention



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)

2015-06-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7917:
-
Priority: Minor  (was: Major)

 Spark doesn't clean up Application Directories (local dirs) 
 

 Key: SPARK-7917
 URL: https://issues.apache.org/jira/browse/SPARK-7917
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zach Fry
Priority: Minor

 Similar to SPARK-4834. 
 Spark does clean up the cache and lock files in the local dirs, however, it 
 doesn't clean up the actual directories. 
 We have to write custom scripts to go back through the local dirs and find 
 directories that don't contain any files and clear those out. 
 Its a pretty simple repro: 
 Run a job that does some shuffling, wait for the shuffle files to get cleaned 
 up, go and look on disk at spark.local.dir and notice that the directory(s) 
 are still there, but there are no files in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8148) Do not use FloatType in partition column inference

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8148.

   Resolution: Fixed
Fix Version/s: (was: 1.4.0)
   1.5.0

 Do not use FloatType in partition column inference
 --

 Key: SPARK-8148
 URL: https://issues.apache.org/jira/browse/SPARK-8148
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 Always use DoubleType to be more stable and less error prone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure

2015-06-08 Thread Patrick Woody (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Woody updated SPARK-8167:
-
Priority: Critical  (was: Major)

 Tasks that fail due to YARN preemption can cause job failure
 

 Key: SPARK-8167
 URL: https://issues.apache.org/jira/browse/SPARK-8167
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 1.3.1
Reporter: Patrick Woody
Priority: Critical

 Tasks that are running on preempted executors will count as FAILED with an 
 ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if 
 a large resource shift is occurring, and the tasks get scheduled to executors 
 that immediately get preempted as well.
 The current workaround is to increase spark.task.maxFailures very high, but 
 that can cause delays in true failures. We should ideally differentiate these 
 task statuses so that they don't count towards the failure limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8126) Use temp directory under build dir for unit tests

2015-06-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8126:
-
Fix Version/s: 1.4.1

 Use temp directory under build dir for unit tests
 -

 Key: SPARK-8126
 URL: https://issues.apache.org/jira/browse/SPARK-8126
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
  Labels: backport-needed
 Fix For: 1.4.1, 1.5.0


 Spark's unit tests leave a lot of garbage in /tmp after a run, making it hard 
 to clean things up. Let's place those files under the build dir so that 
 mvn|sbt|git clean can do their job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8169) Add StopWordsRemover as a transformer

2015-06-08 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8169:


 Summary: Add StopWordsRemover as a transformer
 Key: SPARK-8169
 URL: https://issues.apache.org/jira/browse/SPARK-8169
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.5.0
Reporter: Xiangrui Meng


StopWordsRemover takes a string array column and outputs a string array column 
with all defined stop words removed. The transformer should also come with a 
standard set of stop words as default.

{code}
val stopWords = new StopWordsRemover()
  .setInputCol(words)
  .setOutputCol(cleanWords)
  .setStopWords(Array(...)) // optional
val output = stopWords.transform(df)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8172) Driver UI should enable viewing of dead executors' logs

2015-06-08 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8172:
-

 Summary: Driver UI should enable viewing of dead executors' logs
 Key: SPARK-8172
 URL: https://issues.apache.org/jira/browse/SPARK-8172
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Josh Rosen


If possible, the Spark driver UI's executor page should include a list of dead 
executors (perhaps of bounded size) and should have log viewer links for 
viewing those dead executors' logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8162) Run spark-shell cause NullPointerException

2015-06-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578141#comment-14578141
 ] 

Apache Spark commented on SPARK-8162:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/6711

 Run spark-shell cause NullPointerException
 --

 Key: SPARK-8162
 URL: https://issues.apache.org/jira/browse/SPARK-8162
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.5.0
Reporter: Weizhong

 run spark-shell on latest master branch, then failed, details are:
 {noformat}
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.5.0-SNAPSHOT
   /_/
 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
 Type in expressions to have them evaluated.
 Type :help for more information.
 error: error while loading JobProgressListener, Missing dependency 'bad 
 symbolic reference. A signature in JobProgressListener.class refers to term 
 annotations
 in package com.google.common which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 JobProgressListener.class.', required by 
 /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class)
 java.lang.NullPointerException
   at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193)
   at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:68)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
   at 
 org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
   at $iwC$$iwC.init(console:9)
   at $iwC.init(console:18)
   at init(console:20)
   at .init(console:24)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:497)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
   at 
 org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
   at 
 org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
   at 
 org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 

[jira] [Updated] (SPARK-8162) Run spark-shell cause NullPointerException

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8162:
-
Affects Version/s: 1.5.0

 Run spark-shell cause NullPointerException
 --

 Key: SPARK-8162
 URL: https://issues.apache.org/jira/browse/SPARK-8162
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.5.0
Reporter: Weizhong

 run spark-shell on latest master branch, then failed, details are:
 {noformat}
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.5.0-SNAPSHOT
   /_/
 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
 Type in expressions to have them evaluated.
 Type :help for more information.
 error: error while loading JobProgressListener, Missing dependency 'bad 
 symbolic reference. A signature in JobProgressListener.class refers to term 
 annotations
 in package com.google.common which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 JobProgressListener.class.', required by 
 /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class)
 java.lang.NullPointerException
   at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193)
   at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:68)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
   at 
 org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
   at $iwC$$iwC.init(console:9)
   at $iwC.init(console:18)
   at init(console:20)
   at .init(console:24)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:497)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
   at 
 org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
   at 
 org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
   at 
 org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
   at 

[jira] [Updated] (SPARK-8162) Run spark-shell cause NullPointerException

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8162:
-
Priority: Blocker  (was: Major)

 Run spark-shell cause NullPointerException
 --

 Key: SPARK-8162
 URL: https://issues.apache.org/jira/browse/SPARK-8162
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.5.0
Reporter: Weizhong
Priority: Blocker

 run spark-shell on latest master branch, then failed, details are:
 {noformat}
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.5.0-SNAPSHOT
   /_/
 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
 Type in expressions to have them evaluated.
 Type :help for more information.
 error: error while loading JobProgressListener, Missing dependency 'bad 
 symbolic reference. A signature in JobProgressListener.class refers to term 
 annotations
 in package com.google.common which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 JobProgressListener.class.', required by 
 /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class)
 java.lang.NullPointerException
   at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193)
   at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:68)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
   at 
 org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
   at $iwC$$iwC.init(console:9)
   at $iwC.init(console:18)
   at init(console:20)
   at .init(console:24)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:497)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
   at 
 org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
   at 
 org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
   at 
 org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
   at 

[jira] [Created] (SPARK-8173) A class which holds all the constants should be present

2015-06-08 Thread sahitya pavurala (JIRA)
sahitya pavurala created SPARK-8173:
---

 Summary: A class which holds all the constants should be present
 Key: SPARK-8173
 URL: https://issues.apache.org/jira/browse/SPARK-8173
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
 Environment: software
Reporter: sahitya pavurala
Priority: Minor


A class which holds all the constants should be present, instead of hardcoding 
every where(Similar to MRConstants.java in MapReduce)

All the parameter names when used should be referenced from that class




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8170) Ctrl-C in pyspark shell doesn't kill running job

2015-06-08 Thread Ashwin Shankar (JIRA)
Ashwin Shankar created SPARK-8170:
-

 Summary: Ctrl-C in pyspark shell doesn't kill running job
 Key: SPARK-8170
 URL: https://issues.apache.org/jira/browse/SPARK-8170
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Ashwin Shankar


Hitting Ctrl-C in spark-sql(and other tools like presto) cancels any running 
job and starts a new input line on the prompt. It would be nice if pyspark 
shell also can do that. Otherwise, in case a user submits a job, say he made a 
mistake, and wants to cancel it, he needs to exit the shell and re-login to 
continue his work. Re-login can be a pain especially in Spark on yarn, since it 
takes a while to allocate AM container and initial executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8171) Support Javascript-based infinite scrolling in Spark log viewers

2015-06-08 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578114#comment-14578114
 ] 

Josh Rosen commented on SPARK-8171:
---

Note: I don't plan to work on this myself; someone else should feel free to 
take it.

 Support Javascript-based infinite scrolling in Spark log viewers
 

 Key: SPARK-8171
 URL: https://issues.apache.org/jira/browse/SPARK-8171
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Josh Rosen

 It would be cool if the Spark Web UI's log viewers supported infinite 
 scrolling so that I can just click / scroll up to go back in the log.  Maybe 
 there's an off-the-shelf Javascript component for this.
 See SPARK-608, where the log viewer pagination was first introduced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8171) Support Javascript-based infinite scrolling in Spark log viewers

2015-06-08 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8171:
-

 Summary: Support Javascript-based infinite scrolling in Spark log 
viewers
 Key: SPARK-8171
 URL: https://issues.apache.org/jira/browse/SPARK-8171
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Josh Rosen


It would be cool if the Spark Web UI's log viewers supported infinite scrolling 
so that I can just click / scroll up to go back in the log.  Maybe there's an 
off-the-shelf Javascript component for this.

See SPARK-608, where the log viewer pagination was first introduced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8162) Run spark-shell cause NullPointerException

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8162.

  Resolution: Fixed
   Fix Version/s: 1.5.0
  1.4.1
Target Version/s: 1.4.1, 1.5.0

 Run spark-shell cause NullPointerException
 --

 Key: SPARK-8162
 URL: https://issues.apache.org/jira/browse/SPARK-8162
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.4.1, 1.5.0
Reporter: Weizhong
Priority: Blocker
 Fix For: 1.4.1, 1.5.0


 run spark-shell on latest master branch, then failed, details are:
 {noformat}
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.5.0-SNAPSHOT
   /_/
 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
 Type in expressions to have them evaluated.
 Type :help for more information.
 error: error while loading JobProgressListener, Missing dependency 'bad 
 symbolic reference. A signature in JobProgressListener.class refers to term 
 annotations
 in package com.google.common which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 JobProgressListener.class.', required by 
 /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class)
 java.lang.NullPointerException
   at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193)
   at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:68)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
   at 
 org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
   at $iwC$$iwC.init(console:9)
   at $iwC.init(console:18)
   at init(console:20)
   at .init(console:24)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:497)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
   at 
 org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
   at 
 org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
   at 
 org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 

[jira] [Commented] (SPARK-7886) Add built-in expressions to FunctionRegistry

2015-06-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578049#comment-14578049
 ] 

Apache Spark commented on SPARK-7886:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6710

 Add built-in expressions to FunctionRegistry
 

 Key: SPARK-7886
 URL: https://issues.apache.org/jira/browse/SPARK-7886
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 Once we do this, we no longer needs to hardcode expressions into the parser 
 (both for internal SQL and Hive QL).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8162) Run spark-shell cause NullPointerException

2015-06-08 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8162:
-
Affects Version/s: 1.4.1

 Run spark-shell cause NullPointerException
 --

 Key: SPARK-8162
 URL: https://issues.apache.org/jira/browse/SPARK-8162
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.4.1, 1.5.0
Reporter: Weizhong
Priority: Blocker
 Fix For: 1.4.1, 1.5.0


 run spark-shell on latest master branch, then failed, details are:
 {noformat}
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.5.0-SNAPSHOT
   /_/
 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
 Type in expressions to have them evaluated.
 Type :help for more information.
 error: error while loading JobProgressListener, Missing dependency 'bad 
 symbolic reference. A signature in JobProgressListener.class refers to term 
 annotations
 in package com.google.common which is not available.
 It may be completely missing from the current classpath, or the version on
 the classpath might be incompatible with the version used when compiling 
 JobProgressListener.class.', required by 
 /opt/apache/spark/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.7.0.jar(org/apache/spark/ui/jobs/JobProgressListener.class)
 java.lang.NullPointerException
   at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193)
   at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:68)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
   at 
 org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
   at $iwC$$iwC.init(console:9)
   at $iwC.init(console:18)
   at init(console:20)
   at .init(console:24)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:497)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
   at 
 org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
   at 
 org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
   at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
   at 
 org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
   at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
   at 
 org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
   at 

[jira] [Reopened] (SPARK-8163) CheckPoint mechanism did not work well when error happened in big streaming

2015-06-08 Thread SaintBacchus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus reopened SPARK-8163:
-

 CheckPoint mechanism did not work well when error happened in big streaming
 ---

 Key: SPARK-8163
 URL: https://issues.apache.org/jira/browse/SPARK-8163
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: SaintBacchus

 I tested it with Kafka DStream.
 Sometimes Kafka Producer had push a lot data to the Kafka Brokers, then 
 Streaming Receiver wanted to pull this data without rate limite.
 At this first batch, Streaming may take 10 or more seconds to comsume this 
 data(batch was 2 second).
 I wanted to describle what the Streaming do more detail at this moment:
 The SC was doing its job; the JobGenerator was still send new batchs to 
 StreamingContext and StreamingContext writed this to the CheckPoint files;And 
 the Receiver still was busy receiving the data from kafka and also tracked 
 this events into CheckPoint.
 Then an error(unexcept error) occured, leading to shutdown the Streaming 
 Application.
 Then we wanted to recover the application from check point files.But since 
 the StreamingContext had record the next few batch, it would be recorvered 
 from the last batch. So the Streaming had already missed the first batch and 
 did not know what data had been actually comsumed by Receiver.
 Setting spark.streaming.concurrentJobs=2 could avoid this problem, but some 
 application can not do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8163) CheckPoint mechanism did not work well when error happened in big streaming

2015-06-08 Thread SaintBacchus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578161#comment-14578161
 ] 

SaintBacchus commented on SPARK-8163:
-

Hi [~sowen] all the description was the problem how I meet it.
Since my poor English, I think you may not understand what i say:
First, Producer had push a lot data to the Kafka Brokers
Second, after a while(about 10s) shutdown the streaming
Third, recover it from checkpoint file

The result is that Streaming skipped many batches.

I really think this is a big problem, so I still reopen this issue.

 CheckPoint mechanism did not work well when error happened in big streaming
 ---

 Key: SPARK-8163
 URL: https://issues.apache.org/jira/browse/SPARK-8163
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: SaintBacchus

 I tested it with Kafka DStream.
 Sometimes Kafka Producer had push a lot data to the Kafka Brokers, then 
 Streaming Receiver wanted to pull this data without rate limite.
 At this first batch, Streaming may take 10 or more seconds to comsume this 
 data(batch was 2 second).
 I wanted to describle what the Streaming do more detail at this moment:
 The SC was doing its job; the JobGenerator was still send new batchs to 
 StreamingContext and StreamingContext writed this to the CheckPoint files;And 
 the Receiver still was busy receiving the data from kafka and also tracked 
 this events into CheckPoint.
 Then an error(unexcept error) occured, leading to shutdown the Streaming 
 Application.
 Then we wanted to recover the application from check point files.But since 
 the StreamingContext had record the next few batch, it would be recorvered 
 from the last batch. So the Streaming had already missed the first batch and 
 did not know what data had been actually comsumed by Receiver.
 Setting spark.streaming.concurrentJobs=2 could avoid this problem, but some 
 application can not do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8170) Ctrl-C in pyspark shell doesn't kill running job

2015-06-08 Thread Ashwin Shankar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated SPARK-8170:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-7006

 Ctrl-C in pyspark shell doesn't kill running job
 

 Key: SPARK-8170
 URL: https://issues.apache.org/jira/browse/SPARK-8170
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Ashwin Shankar

 Hitting Ctrl-C in spark-sql(and other tools like presto) cancels any running 
 job and starts a new input line on the prompt. It would be nice if pyspark 
 shell also can do that. Otherwise, in case a user submits a job, say he made 
 a mistake, and wants to cancel it, he needs to exit the shell and re-login to 
 continue his work. Re-login can be a pain especially in Spark on yarn, since 
 it takes a while to allocate AM container and initial executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4809) Improve Guava shading in Spark

2015-06-08 Thread Ronald Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578146#comment-14578146
 ] 

Ronald Chen commented on SPARK-4809:


This change makes no sense.  This breaks SPARK-2848 attempt to move Guava's 
dependency off of the user's class path.

Now I cannot use my own version of guava without these classes conflicting.

 Improve Guava shading in Spark
 --

 Key: SPARK-4809
 URL: https://issues.apache.org/jira/browse/SPARK-4809
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.3.0


 As part of SPARK-2848, we started shading Guava to help with projects that 
 want to use Spark but use an incompatible version of Guava.
 The approach used there is a little sub-optimal, though. It makes it tricky, 
 especially, to run unit tests in your project when those need to use 
 spark-core APIs.
 We should make the shading more transparent so that it's easier to use 
 spark-core, with or without an explicit Guava dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3577) Add task metric to report spill time

2015-06-08 Thread Ming Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578240#comment-14578240
 ] 

Ming Chen edited comment on SPARK-3577 at 6/9/15 5:26 AM:
--

Why have not the metric been added? I think this is rather important, for that 
it may affect the results of the research work on this :  
https://kayousterhout.github.io/trace-analysis/


was (Author: mammothcm):
Why have not the metric been added? I think this is rather important, it may 
affect the results of the research work on this :  
https://kayousterhout.github.io/trace-analysis/

 Add task metric to report spill time
 

 Key: SPARK-3577
 URL: https://issues.apache.org/jira/browse/SPARK-3577
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.1.0
Reporter: Kay Ousterhout
Assignee: Sandy Ryza
Priority: Minor

 The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
 {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
 We should probably add task metrics to report this spill time, since for 
 shuffles, this would have previously been reported as part of shuffle write 
 time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8174) unix_timestamp

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8174:
--

 Summary: unix_timestamp
 Key: SPARK-8174
 URL: https://issues.apache.org/jira/browse/SPARK-8174
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


3 variants:

unix_timestamp(): long
Gets current Unix timestamp in seconds.

unix_timestamp(string date): long
Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
seconds), using the default timezone and the default locale, return 0 if fail: 
unix_timestamp('2009-03-20 11:30:01') = 1237573801


unix_timestamp(string date, string pattern): long
Convert time string with given pattern (see 
[http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to 
Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', 
'-MM-dd') = 1237532400.


See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8174) unix_timestamp

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8174:
---
Description: 
3 variants:

{code}
unix_timestamp(): long
Gets current Unix timestamp in seconds.

unix_timestamp(string date): long
Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
seconds), using the default timezone and the default locale, return 0 if fail: 
unix_timestamp('2009-03-20 11:30:01') = 1237573801


unix_timestamp(string date, string pattern): long
Convert time string with given pattern (see 
[http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to 
Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', 
'-MM-dd') = 1237532400.
{code}

See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

  was:
3 variants:

unix_timestamp(): long
Gets current Unix timestamp in seconds.

unix_timestamp(string date): long
Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
seconds), using the default timezone and the default locale, return 0 if fail: 
unix_timestamp('2009-03-20 11:30:01') = 1237573801


unix_timestamp(string date, string pattern): long
Convert time string with given pattern (see 
[http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to 
Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', 
'-MM-dd') = 1237532400.


See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


 unix_timestamp
 --

 Key: SPARK-8174
 URL: https://issues.apache.org/jira/browse/SPARK-8174
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 3 variants:
 {code}
 unix_timestamp(): long
 Gets current Unix timestamp in seconds.
 unix_timestamp(string date): long
 Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
 seconds), using the default timezone and the default locale, return 0 if 
 fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
 unix_timestamp(string date, string pattern): long
 Convert time string with given pattern (see 
 [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 
 to Unix time stamp (in seconds), return 0 if fail: 
 unix_timestamp('2009-03-20', '-MM-dd') = 1237532400.
 {code}
 See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8175) from_unixtime function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8175:
---
Summary: from_unixtime function  (was: from_unixtime expression)

 from_unixtime function
 --

 Key: SPARK-8175
 URL: https://issues.apache.org/jira/browse/SPARK-8175
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 from_unixtime(bigint unixtime[, string format]): string
 Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a 
 string representing the timestamp of that moment in the current system time 
 zone in the format of 1970-01-01 00:00:00.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8180) day / dayofmonth function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8180:
--

 Summary: day / dayofmonth function
 Key: SPARK-8180
 URL: https://issues.apache.org/jira/browse/SPARK-8180
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


day(string date): int
dayofmonth(date): int

Returns the day part of a date or a timestamp string: day(1970-11-01 
00:00:00) = 1, day(1970-11-01) = 1.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8180) day / dayofmonth function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8180:
---
Description: 
day(string date): int
dayofmonth(date): int

Returns the day part of a date or a timestamp string: day(1970-11-01 
00:00:00) = 1, day(1970-11-01) = 1.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


  was:
day(string date): int
dayofmonth(date): int

Returns the day part of a date or a timestamp string: day(1970-11-01 
00:00:00) = 1, day(1970-11-01) = 1.




 day / dayofmonth function
 -

 Key: SPARK-8180
 URL: https://issues.apache.org/jira/browse/SPARK-8180
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 day(string date): int
 dayofmonth(date): int
 Returns the day part of a date or a timestamp string: day(1970-11-01 
 00:00:00) = 1, day(1970-11-01) = 1.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8178) quarter function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8178:
---
Description: 
quarter(string|date|timestamp): int

Returns the quarter of the year for a date, timestamp, or string in the range 1 
to 4. Example: quarter('2015-04-08') = 2.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

  was:
quarter(date/timestamp/string): int

Returns the quarter of the year for a date, timestamp, or string in the range 1 
to 4. Example: quarter('2015-04-08') = 2.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


 quarter function
 

 Key: SPARK-8178
 URL: https://issues.apache.org/jira/browse/SPARK-8178
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 quarter(string|date|timestamp): int
 Returns the quarter of the year for a date, timestamp, or string in the range 
 1 to 4. Example: quarter('2015-04-08') = 2.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8177) year function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8177:
---
Description: 
year(string|date|timestamp): int

Returns the year part of a date or a timestamp string: year(1970-01-01 
00:00:00) = 1970, year(1970-01-01) = 1970.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

  was:
year(string date): int

Returns the year part of a date or a timestamp string: year(1970-01-01 
00:00:00) = 1970, year(1970-01-01) = 1970.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


 year function
 -

 Key: SPARK-8177
 URL: https://issues.apache.org/jira/browse/SPARK-8177
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 year(string|date|timestamp): int
 Returns the year part of a date or a timestamp string: year(1970-01-01 
 00:00:00) = 1970, year(1970-01-01) = 1970.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8180) day / dayofmonth function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8180:
---
Description: 
day(string|date|timestamp): int
dayofmonth(string|date|timestamp): int

Returns the day part of a date or a timestamp string: day(1970-11-01 
00:00:00) = 1, day(1970-11-01) = 1.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


  was:
day(string date): int
dayofmonth(date): int

Returns the day part of a date or a timestamp string: day(1970-11-01 
00:00:00) = 1, day(1970-11-01) = 1.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



 day / dayofmonth function
 -

 Key: SPARK-8180
 URL: https://issues.apache.org/jira/browse/SPARK-8180
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 day(string|date|timestamp): int
 dayofmonth(string|date|timestamp): int
 Returns the day part of a date or a timestamp string: day(1970-11-01 
 00:00:00) = 1, day(1970-11-01) = 1.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8181) hour function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8181:
---
Description: 
hour(string|date|timestamp): int

Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, 
hour('12:58:59') = 12.


  was:
hour(string date): int

Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, 
hour('12:58:59') = 12.



 hour function
 -

 Key: SPARK-8181
 URL: https://issues.apache.org/jira/browse/SPARK-8181
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 hour(string|date|timestamp): int
 Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, 
 hour('12:58:59') = 12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8179) month function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8179:
---
Description: 
month(string|date|timestamp): int

Returns the month part of a date or a timestamp string: month(1970-11-01 
00:00:00) = 11, month(1970-11-01) = 11.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

  was:
month(string date): int

Returns the month part of a date or a timestamp string: month(1970-11-01 
00:00:00) = 11, month(1970-11-01) = 11.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


 month function
 --

 Key: SPARK-8179
 URL: https://issues.apache.org/jira/browse/SPARK-8179
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 month(string|date|timestamp): int
 Returns the month part of a date or a timestamp string: month(1970-11-01 
 00:00:00) = 11, month(1970-11-01) = 11.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8182) minute function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8182:
---
Description: 
minute(string|date|timestamp): int

Returns the minute of the timestamp.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

  was:
minute(string date): int

Returns the minute of the timestamp.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


 minute function
 ---

 Key: SPARK-8182
 URL: https://issues.apache.org/jira/browse/SPARK-8182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 minute(string|date|timestamp): int
 Returns the minute of the timestamp.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8176) to_date function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8176:
---
Description: 
to_date(date|timestamp): date
to_date(string): string

Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 
1970-01-01.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



  was:
to_date(string|date|timestamp): string

Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 
1970-01-01.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


 to_date function
 

 Key: SPARK-8176
 URL: https://issues.apache.org/jira/browse/SPARK-8176
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 to_date(date|timestamp): date
 to_date(string): string
 Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 
 1970-01-01.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8183) second function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8183:
--

 Summary: second function
 Key: SPARK-8183
 URL: https://issues.apache.org/jira/browse/SPARK-8183
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


second(string date): int

Returns the second of the timestamp.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8183) second function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8183:
---
Target Version/s: 1.5.0

 second function
 ---

 Key: SPARK-8183
 URL: https://issues.apache.org/jira/browse/SPARK-8183
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 second(string date): int
 Returns the second of the timestamp.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8181) hour function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8181:
--

 Summary: hour function
 Key: SPARK-8181
 URL: https://issues.apache.org/jira/browse/SPARK-8181
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


hour(string date): int

Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, 
hour('12:58:59') = 12.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8182) minute function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8182:
--

 Summary: minute function
 Key: SPARK-8182
 URL: https://issues.apache.org/jira/browse/SPARK-8182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


minute(string date): int

Returns the minute of the timestamp.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8186) date_add function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8186:
---
Target Version/s: 1.5.0

 date_add function
 -

 Key: SPARK-8186
 URL: https://issues.apache.org/jira/browse/SPARK-8186
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 date_add(string startdate, int days): string
 date_add(date startdate, int days): date
 Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3577) Add task metric to report spill time

2015-06-08 Thread Ming Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578240#comment-14578240
 ] 

Ming Chen commented on SPARK-3577:
--

Why have not the metric been added? I think this is rather important, it may 
affect the results of the research work on this :  
https://kayousterhout.github.io/trace-analysis/

 Add task metric to report spill time
 

 Key: SPARK-3577
 URL: https://issues.apache.org/jira/browse/SPARK-3577
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.1.0
Reporter: Kay Ousterhout
Assignee: Sandy Ryza
Priority: Minor

 The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
 {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
 We should probably add task metrics to report this spill time, since for 
 shuffles, this would have previously been reported as part of shuffle write 
 time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7886) Add built-in expressions to FunctionRegistry

2015-06-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578335#comment-14578335
 ] 

Apache Spark commented on SPARK-7886:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6712

 Add built-in expressions to FunctionRegistry
 

 Key: SPARK-7886
 URL: https://issues.apache.org/jira/browse/SPARK-7886
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 Once we do this, we no longer needs to hardcode expressions into the parser 
 (both for internal SQL and Hive QL).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8174) unix_timestamp expression

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8174:
---
Summary: unix_timestamp expression  (was: unix_timestamp)

 unix_timestamp expression
 -

 Key: SPARK-8174
 URL: https://issues.apache.org/jira/browse/SPARK-8174
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 3 variants:
 {code}
 unix_timestamp(): long
 Gets current Unix timestamp in seconds.
 unix_timestamp(string date): long
 Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
 seconds), using the default timezone and the default locale, return 0 if 
 fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
 unix_timestamp(string date, string pattern): long
 Convert time string with given pattern (see 
 [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 
 to Unix time stamp (in seconds), return 0 if fail: 
 unix_timestamp('2009-03-20', '-MM-dd') = 1237532400.
 {code}
 See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8175) from_unixtime expression

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8175:
---
Summary: from_unixtime expression  (was: from_unixtime)

 from_unixtime expression
 

 Key: SPARK-8175
 URL: https://issues.apache.org/jira/browse/SPARK-8175
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 from_unixtime(bigint unixtime[, string format]): string
 Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a 
 string representing the timestamp of that moment in the current system time 
 zone in the format of 1970-01-01 00:00:00.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8175) from_unixtime

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8175:
--

 Summary: from_unixtime
 Key: SPARK-8175
 URL: https://issues.apache.org/jira/browse/SPARK-8175
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


from_unixtime(bigint unixtime[, string format]): string

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a 
string representing the timestamp of that moment in the current system time 
zone in the format of 1970-01-01 00:00:00.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8174) unix_timestamp function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8174:
---
Description: 
3 variants:

{code}
unix_timestamp(): long
Gets current Unix timestamp in seconds.

unix_timestamp(string|date): long
Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
seconds), using the default timezone and the default locale, return 0 if fail: 
unix_timestamp('2009-03-20 11:30:01') = 1237573801


unix_timestamp(string date, string pattern): long
Convert time string with given pattern (see 
[http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to 
Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', 
'-MM-dd') = 1237532400.
{code}

See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

  was:
3 variants:

{code}
unix_timestamp(): long
Gets current Unix timestamp in seconds.

unix_timestamp(string date): long
Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
seconds), using the default timezone and the default locale, return 0 if fail: 
unix_timestamp('2009-03-20 11:30:01') = 1237573801


unix_timestamp(string date, string pattern): long
Convert time string with given pattern (see 
[http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to 
Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', 
'-MM-dd') = 1237532400.
{code}

See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


 unix_timestamp function
 ---

 Key: SPARK-8174
 URL: https://issues.apache.org/jira/browse/SPARK-8174
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 3 variants:
 {code}
 unix_timestamp(): long
 Gets current Unix timestamp in seconds.
 unix_timestamp(string|date): long
 Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
 seconds), using the default timezone and the default locale, return 0 if 
 fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
 unix_timestamp(string date, string pattern): long
 Convert time string with given pattern (see 
 [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 
 to Unix time stamp (in seconds), return 0 if fail: 
 unix_timestamp('2009-03-20', '-MM-dd') = 1237532400.
 {code}
 See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8174) unix_timestamp function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8174:
---
Summary: unix_timestamp function  (was: unix_timestamp expression)

 unix_timestamp function
 ---

 Key: SPARK-8174
 URL: https://issues.apache.org/jira/browse/SPARK-8174
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 3 variants:
 {code}
 unix_timestamp(): long
 Gets current Unix timestamp in seconds.
 unix_timestamp(string date): long
 Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
 seconds), using the default timezone and the default locale, return 0 if 
 fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
 unix_timestamp(string date, string pattern): long
 Convert time string with given pattern (see 
 [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 
 to Unix time stamp (in seconds), return 0 if fail: 
 unix_timestamp('2009-03-20', '-MM-dd') = 1237532400.
 {code}
 See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8176) to_date function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8176:
--

 Summary: to_date function
 Key: SPARK-8176
 URL: https://issues.apache.org/jira/browse/SPARK-8176
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


to_date(string timestamp): string

Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 
1970-01-01.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8177) year function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8177:
--

 Summary: year function
 Key: SPARK-8177
 URL: https://issues.apache.org/jira/browse/SPARK-8177
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


year(string date): int

Returns the year part of a date or a timestamp string: year(1970-01-01 
00:00:00) = 1970, year(1970-01-01) = 1970.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8168) Add Python friendly constructor to PipelineModel

2015-06-08 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai closed SPARK-8168.
--
Resolution: Fixed

Issue resolved by pull request https://github.com/apache/spark/pull/6709

 Add Python friendly constructor to PipelineModel
 

 Key: SPARK-8168
 URL: https://issues.apache.org/jira/browse/SPARK-8168
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We are trying to migrate all Python implementations of Pipeline components to 
 Scala. As part of this effort, PipelineModel should have a Python-friendly 
 constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6820) Convert NAs to null type in SparkR DataFrames

2015-06-08 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-6820.
--
   Resolution: Fixed
Fix Version/s: 1.4.1
   1.5.0

Issue resolved by pull request 6190
[https://github.com/apache/spark/pull/6190]

 Convert NAs to null type in SparkR DataFrames
 -

 Key: SPARK-6820
 URL: https://issues.apache.org/jira/browse/SPARK-6820
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Qian Huang
Priority: Critical
 Fix For: 1.5.0, 1.4.1


 While converting RDD or local R DataFrame to a SparkR DataFrame we need to 
 handle missing values or NAs.
 We should convert NAs to SparkSQL's null type to handle the conversion 
 correctly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8177) year function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8177:
---
Target Version/s: 1.5.0

 year function
 -

 Key: SPARK-8177
 URL: https://issues.apache.org/jira/browse/SPARK-8177
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 year(string date): int
 Returns the year part of a date or a timestamp string: year(1970-01-01 
 00:00:00) = 1970, year(1970-01-01) = 1970.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8179) month function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8179:
--

 Summary: month function
 Key: SPARK-8179
 URL: https://issues.apache.org/jira/browse/SPARK-8179
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


month(string date): int

Returns the month part of a date or a timestamp string: month(1970-11-01 
00:00:00) = 11, month(1970-11-01) = 11.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8178) quarter function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8178:
--

 Summary: quarter function
 Key: SPARK-8178
 URL: https://issues.apache.org/jira/browse/SPARK-8178
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


quarter(date/timestamp/string): int

Returns the quarter of the year for a date, timestamp, or string in the range 1 
to 4. Example: quarter('2015-04-08') = 2.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8176) to_date function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8176:
---
Description: 
to_date(string|date|timestamp): string

Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 
1970-01-01.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

  was:
to_date(string timestamp): string

Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 
1970-01-01.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


 to_date function
 

 Key: SPARK-8176
 URL: https://issues.apache.org/jira/browse/SPARK-8176
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 to_date(string|date|timestamp): string
 Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 
 1970-01-01.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8184) weekofyear function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8184:
--

 Summary: weekofyear function
 Key: SPARK-8184
 URL: https://issues.apache.org/jira/browse/SPARK-8184
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


weekofyear(string|date|timestamp): int

Returns the week number of a timestamp string: weekofyear(1970-11-01 
00:00:00) = 44, weekofyear(1970-11-01) = 44.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8183) second function

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8183:
---
Description: 
second(string|date|timestamp): int

Returns the second of the timestamp.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

  was:
second(string date): int

Returns the second of the timestamp.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


 second function
 ---

 Key: SPARK-8183
 URL: https://issues.apache.org/jira/browse/SPARK-8183
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 second(string|date|timestamp): int
 Returns the second of the timestamp.
 See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8185) datediff function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8185:
--

 Summary: datediff function
 Key: SPARK-8185
 URL: https://issues.apache.org/jira/browse/SPARK-8185
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


datediff(date enddate, date startdate): int

Returns the number of days from startdate to enddate: datediff('2009-03-01', 
'2009-02-27') = 2.


See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8186) date_add function

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8186:
--

 Summary: date_add function
 Key: SPARK-8186
 URL: https://issues.apache.org/jira/browse/SPARK-8186
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


date_add(string startdate, int days): string

date_add(date startdate, int days): date


Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8155) support of proxy user not working in spaced username on windows

2015-06-08 Thread Kaveen Raajan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaveen Raajan updated SPARK-8155:
-
Fix Version/s: (was: 1.3.0)

 support of proxy user not working in spaced username on windows
 ---

 Key: SPARK-8155
 URL: https://issues.apache.org/jira/browse/SPARK-8155
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.3.1
 Environment: windows-8/7/server 2008
 Hadoop-2.5.2
 Java-1.7.51
 username - kaveen raajan
Reporter: Kaveen Raajan

 I'm using SPARK1.3.1 on windows machine contain space on username (current 
 username-kaveen raajan). I tried to run following command
  {code}spark-shell --master yarn-client --proxy-user SYSTEM {code}
 I able to run successfully on non-space user application also running in 
 SYSTEM user, But When I try to run in spaced user (kaveen raajan) mean it 
 throws following error.
 {code}
 15/06/05 16:52:48 INFO spark.SecurityManager: Changing view acls to: SYSTEM
 15/06/05 16:52:48 INFO spark.SecurityManager: Changing modify acls to: SYSTEM
 15/06/05 16:52:48 INFO spark.SecurityManager: SecurityManager: authentication 
 di
 sabled; ui acls disabled; users with view permissions: Set(SYSTEM); users 
 with m
 odify permissions: Set(SYSTEM)
 15/06/05 16:52:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/06/05 16:52:49 INFO Remoting: Starting remoting
 15/06/05 16:52:49 INFO Remoting: Remoting started; listening on addresses 
 :[akka
 .tcp://sparkDriver@Master:52137]
 15/06/05 16:52:49 INFO util.Utils: Successfully started service 'sparkDriver' 
 on
  port 52137.
 15/06/05 16:52:49 INFO spark.SparkEnv: Registering MapOutputTracker
 15/06/05 16:52:49 INFO spark.SparkEnv: Registering BlockManagerMaster
 15/06/05 16:52:49 INFO storage.DiskBlockManager: Created local directory at 
 C:\U
 sers\KAVEEN~1\AppData\Local\Temp\spark-d5b43891-274c-457d-aa3a-d79a536fd536\bloc
 kmgr-e980101b-4f93-455a-8a05-9185dcab9f8e
 15/06/05 16:52:49 INFO storage.MemoryStore: MemoryStore started with capacity 
 26
 5.4 MB
 15/06/05 16:52:49 INFO spark.HttpFileServer: HTTP File server directory is 
 C:\Us
 ers\KAVEEN~1\AppData\Local\Temp\spark-a35e3f17-641c-4ae3-90f2-51eac901b799\httpd
 -ecea93ad-c285-4c62-9222-01a9d6ff24e4
 15/06/05 16:52:49 INFO spark.HttpServer: Starting HTTP Server
 15/06/05 16:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/06/05 16:52:49 INFO server.AbstractConnector: Started 
 SocketConnector@0.0.0.0
 :52138
 15/06/05 16:52:49 INFO util.Utils: Successfully started service 'HTTP file 
 serve
 r' on port 52138.
 15/06/05 16:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
 15/06/05 16:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/06/05 16:52:49 INFO server.AbstractConnector: Started 
 SelectChannelConnector@
 0.0.0.0:4040
 15/06/05 16:52:49 INFO util.Utils: Successfully started service 'SparkUI' on 
 por
 t 4040.
 15/06/05 16:52:49 INFO ui.SparkUI: Started SparkUI at http://Master:4040
 15/06/05 16:52:49 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0
 :8032
 java.lang.NullPointerException
 at org.apache.spark.sql.SQLContext.init(SQLContext.scala:145)
 at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:49)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
 orAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
 onstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at 
 org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:10
 27)
 at $iwC$$iwC.init(console:9)
 at $iwC.init(console:18)
 at init(console:20)
 at .init(console:24)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
 java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
 sorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:
 1065)
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:
 1338)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840
 )
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
 at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:8
 56)
  

[jira] [Commented] (SPARK-8072) Better AnalysisException for writing DataFrame with identically named columns

2015-06-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576639#comment-14576639
 ] 

Reynold Xin commented on SPARK-8072:


Yes - adding a rule in CheckAnalysis.scala to check the column names would work.


 Better AnalysisException for writing DataFrame with identically named columns
 -

 Key: SPARK-8072
 URL: https://issues.apache.org/jira/browse/SPARK-8072
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 We should check if there are duplicate columns, and if yes, throw an explicit 
 error message saying there are duplicate columns. See current error message 
 below. 
 {code}
 In [3]: df.withColumn('age', df.age)
 Out[3]: DataFrame[age: bigint, name: string, age: bigint]
 In [4]: df.withColumn('age', df.age).write.parquet('test-parquet.out')
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-4-eecb85256898 in module()
  1 df.withColumn('age', df.age).write.parquet('test-parquet.out')
 /scratch/rxin/spark/python/pyspark/sql/readwriter.py in parquet(self, path, 
 mode)
 350  df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data'))
 351 
 -- 352 self._jwrite.mode(mode).parquet(path)
 353 
 354 @since(1.4)
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/java_gateway.pyc
  in __call__(self, *args)
 535 answer = self.gateway_client.send_command(command)
 536 return_value = get_return_value(answer, self.gateway_client,
 -- 537 self.target_id, self.name)
 538 
 539 for temp_arg in temp_args:
 /Users/rxin/anaconda/lib/python2.7/site-packages/py4j-0.8.1-py2.7.egg/py4j/protocol.pyc
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o35.parquet.
 : org.apache.spark.sql.AnalysisException: Reference 'age' is ambiguous, could 
 be: age#0L, age#3L.;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 

[jira] [Resolved] (SPARK-8154) Remove Term/Code type aliases in code generation

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8154.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Remove Term/Code type aliases in code generation
 

 Key: SPARK-8154
 URL: https://issues.apache.org/jira/browse/SPARK-8154
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 From my perspective as a code reviewer, I find them more confusing than using 
 String directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8153) Add configuration for disabling partial aggregation in runtime

2015-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8153:
---

Assignee: Apache Spark

 Add configuration for disabling partial aggregation in runtime
 --

 Key: SPARK-8153
 URL: https://issues.apache.org/jira/browse/SPARK-8153
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Navis
Assignee: Apache Spark
Priority: Trivial

 Same thing with hive.map.aggr.hash.min.reduction in hive, which disables 
 hash aggregation if it's not sufficiently decreasing the output size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8096) how to convert dataframe field to LabelPoint

2015-06-08 Thread bofei.xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573922#comment-14573922
 ] 

bofei.xiao edited comment on SPARK-8096 at 6/8/15 6:28 AM:
---

I'm sorry i haven't expressed my question clearly!


was (Author: bofei.xiao):
I'm sorry i haven't express my question clearly!

 how to convert dataframe field to LabelPoint
 

 Key: SPARK-8096
 URL: https://issues.apache.org/jira/browse/SPARK-8096
 Project: Spark
  Issue Type: Bug
Reporter: bofei.xiao

 how to convert the dataframe to RDD[LabelPoint]
 dataframe with fields target,age,sex,height
 i want to cast target as label,age,sex,height as features vector
 I faced this problem in the following circumstance:
 --
 given i have a csv file data.csv
 target,age,sex,height
 1,18,1,170
 0,25,1,165
 .
 now,i want build a decisitin model
 step 1:load csv data as dataframe
 val data= sqlContext.load(com.databricks.spark.csv,:Map(path - 
 data.csv, header - true)
 step 2:build a decisiontree model
 but decisiontree need a RDD[LabelPoint] input
 thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8159) Improve expression coverage

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8159:
--

 Summary: Improve expression coverage
 Key: SPARK-8159
 URL: https://issues.apache.org/jira/browse/SPARK-8159
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


This is an umbrella ticket to track new expressions we are adding to 
SQL/DataFrame. For each new expression, we should implement the code generated 
version as well as comprehensive unit tests (for all the data types the 
expressions support).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-08 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576721#comment-14576721
 ] 

Saisai Shao commented on SPARK-4352:


Hi [~sandyr], I re-implement the allocation strategy according to the 
discussion,  now the new strategy will consider:

1. allocating new containers with with locality distribution ratio for locality 
required tasks, and for locality-free tasks let Yarn to decide the location as 
previous code.
2. Combine the existing executor/host distribution with new required containers 
to better suit for dynamic allocation.
3. If dynamic allocation is not enabled, or preferred locality is empty, the 
code logic is the same as previous code.
4. locality required executors will have the high priority to be allocated with 
new container if current containers are not enough to match all the locality 
required tasks.

Please help to review, any comment is greatly appreciated.

 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8157) should use expression.unapply to match in HiveTypeCoercion

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-8157.
--
Resolution: Not A Problem

see https://github.com/apache/spark/pull/6698

 should use expression.unapply to match in HiveTypeCoercion
 --

 Key: SPARK-8157
 URL: https://issues.apache.org/jira/browse/SPARK-8157
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang

 This is a bug introduced by #6405



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8157) should use expression.unapply to match in HiveTypeCoercion

2015-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8157:
---

Assignee: Apache Spark

 should use expression.unapply to match in HiveTypeCoercion
 --

 Key: SPARK-8157
 URL: https://issues.apache.org/jira/browse/SPARK-8157
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Assignee: Apache Spark

 This is a bug introduced by #6405



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8157) should use expression.unapply to match in HiveTypeCoercion

2015-06-08 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-8157:
--

 Summary: should use expression.unapply to match in HiveTypeCoercion
 Key: SPARK-8157
 URL: https://issues.apache.org/jira/browse/SPARK-8157
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang


This is a bug introduced by #6405



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8158) HiveShim improvement

2015-06-08 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang updated SPARK-8158:
---
Component/s: SQL

 HiveShim improvement
 

 Key: SPARK-8158
 URL: https://issues.apache.org/jira/browse/SPARK-8158
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang

 1. explicitly import implicit conversion support.
 2. use .nonEmpty instead of .size  0
 3. use val instead of var
 4. comment indention



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7082) Binary processing external sort-merge join

2015-06-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7082:
---
Summary: Binary processing external sort-merge join  (was: Binary 
processing sort-merge join)

 Binary processing external sort-merge join
 --

 Key: SPARK-7082
 URL: https://issues.apache.org/jira/browse/SPARK-7082
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8160) Tungsten style external aggregation

2015-06-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8160:
--

 Summary: Tungsten style external aggregation
 Key: SPARK-8160
 URL: https://issues.apache.org/jira/browse/SPARK-8160
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


Support using external sorting to run aggregate so we can easily process 
aggregates where each partition is much larger than memory size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7792) HiveContext registerTempTable not thread safe

2015-06-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576681#comment-14576681
 ] 

Apache Spark commented on SPARK-7792:
-

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/6699

 HiveContext registerTempTable not thread safe
 -

 Key: SPARK-7792
 URL: https://issues.apache.org/jira/browse/SPARK-7792
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Yana Kadiyska

 {code:java}
 public class ThreadRepro {
 public static void main(String[] args) throws Exception{
new ThreadRepro().sparkPerfTest();
 }
 public void sparkPerfTest(){
 final AtomicLong counter = new AtomicLong();
 SparkConf conf = new SparkConf();
 conf.setAppName(My Application);
 conf.setMaster(local[7]);
 SparkContext sc = new SparkContext(conf);
 org.apache.spark.sql.hive.HiveContext hc = new 
 org.apache.spark.sql.hive.HiveContext(sc);
 int poolSize = 10;
 ExecutorService pool = Executors.newFixedThreadPool(poolSize);
 for (int i=0; ipoolSize;i++ )
 pool.execute(new QueryJob(hc, i, counter));
 pool.shutdown();
 try {
 pool.awaitTermination(60, TimeUnit.MINUTES);
 }catch(Exception e){
 System.out.println(Thread interrupted);
 }
 System.out.println(All jobs complete);
 System.out.println( Counter is +counter.get());
 }
 }
 class QueryJob implements Runnable{
 String threadId;
 org.apache.spark.sql.hive.HiveContext sqlContext;
 String key;
 AtomicLong counter;
 final AtomicLong local_counter = new AtomicLong();
 public QueryJob(org.apache.spark.sql.hive.HiveContext _sqlContext,int 
 id,AtomicLong ctr){
 threadId = thread_+id;
 this.sqlContext= _sqlContext;
 this.counter = ctr;
 }
 public void run() {
 for (int i = 0; i  100; i++) {
 String tblName = threadId +_+i;
 DataFrame df = sqlContext.emptyDataFrame();
 df.registerTempTable(tblName);
 String _query = String.format(select count(*) from %s,tblName);
 System.out.println(String.format( registered table %s; catalog 
 (%s) ,tblName,debugTables()));
 ListRow res;
 try {
 res = sqlContext.sql(_query).collectAsList();
 }catch (Exception e){
 System.out.println(*Exception + debugTables() +**);
 throw e;
 }
 sqlContext.dropTempTable(tblName);
 System.out.println( dropped table +tblName);
 try {
 Thread.sleep(3000);//lets make this a not-so-tight loop
 }catch(Exception e){
 System.out.println(Thread interrupted);
 }
 }
 }
 private String debugTables(){
 String v = Joiner.on(',').join(sqlContext.tableNames());
 if (v==null)return ; else return v;
 }
 }
 {code}
 this will periodically produce the following:
 {quote}
  registered table thread_0_50; catalog (thread_1_50)
  registered table thread_4_50; catalog (thread_4_50,thread_1_50)
  registered table thread_1_50; catalog (thread_1_50)
  dropped table thread_1_50
  dropped table thread_4_50
 *Exception **
 Exception in thread pool-6-thread-1 java.lang.Error: 
 org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 
 21
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.spark.sql.AnalysisException: no such table thread_0_50; 
 line 1 pos 21
   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:177)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 

[jira] [Assigned] (SPARK-7792) HiveContext registerTempTable not thread safe

2015-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7792:
---

Assignee: Apache Spark

 HiveContext registerTempTable not thread safe
 -

 Key: SPARK-7792
 URL: https://issues.apache.org/jira/browse/SPARK-7792
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Yana Kadiyska
Assignee: Apache Spark

 {code:java}
 public class ThreadRepro {
 public static void main(String[] args) throws Exception{
new ThreadRepro().sparkPerfTest();
 }
 public void sparkPerfTest(){
 final AtomicLong counter = new AtomicLong();
 SparkConf conf = new SparkConf();
 conf.setAppName(My Application);
 conf.setMaster(local[7]);
 SparkContext sc = new SparkContext(conf);
 org.apache.spark.sql.hive.HiveContext hc = new 
 org.apache.spark.sql.hive.HiveContext(sc);
 int poolSize = 10;
 ExecutorService pool = Executors.newFixedThreadPool(poolSize);
 for (int i=0; ipoolSize;i++ )
 pool.execute(new QueryJob(hc, i, counter));
 pool.shutdown();
 try {
 pool.awaitTermination(60, TimeUnit.MINUTES);
 }catch(Exception e){
 System.out.println(Thread interrupted);
 }
 System.out.println(All jobs complete);
 System.out.println( Counter is +counter.get());
 }
 }
 class QueryJob implements Runnable{
 String threadId;
 org.apache.spark.sql.hive.HiveContext sqlContext;
 String key;
 AtomicLong counter;
 final AtomicLong local_counter = new AtomicLong();
 public QueryJob(org.apache.spark.sql.hive.HiveContext _sqlContext,int 
 id,AtomicLong ctr){
 threadId = thread_+id;
 this.sqlContext= _sqlContext;
 this.counter = ctr;
 }
 public void run() {
 for (int i = 0; i  100; i++) {
 String tblName = threadId +_+i;
 DataFrame df = sqlContext.emptyDataFrame();
 df.registerTempTable(tblName);
 String _query = String.format(select count(*) from %s,tblName);
 System.out.println(String.format( registered table %s; catalog 
 (%s) ,tblName,debugTables()));
 ListRow res;
 try {
 res = sqlContext.sql(_query).collectAsList();
 }catch (Exception e){
 System.out.println(*Exception + debugTables() +**);
 throw e;
 }
 sqlContext.dropTempTable(tblName);
 System.out.println( dropped table +tblName);
 try {
 Thread.sleep(3000);//lets make this a not-so-tight loop
 }catch(Exception e){
 System.out.println(Thread interrupted);
 }
 }
 }
 private String debugTables(){
 String v = Joiner.on(',').join(sqlContext.tableNames());
 if (v==null)return ; else return v;
 }
 }
 {code}
 this will periodically produce the following:
 {quote}
  registered table thread_0_50; catalog (thread_1_50)
  registered table thread_4_50; catalog (thread_4_50,thread_1_50)
  registered table thread_1_50; catalog (thread_1_50)
  dropped table thread_1_50
  dropped table thread_4_50
 *Exception **
 Exception in thread pool-6-thread-1 java.lang.Error: 
 org.apache.spark.sql.AnalysisException: no such table thread_0_50; line 1 pos 
 21
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.spark.sql.AnalysisException: no such table thread_0_50; 
 line 1 pos 21
   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:177)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at 

[jira] [Created] (SPARK-8158) HiveShim improvement

2015-06-08 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-8158:
--

 Summary: HiveShim improvement
 Key: SPARK-8158
 URL: https://issues.apache.org/jira/browse/SPARK-8158
 Project: Spark
  Issue Type: Improvement
Reporter: Adrian Wang


1. explicitly import implicit conversion support.
2. use .nonEmpty instead of .size  0
3. use val instead of var
4. comment indention



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8011) DecimalType is not a datatype

2015-06-08 Thread Bipin Roshan Nag (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bipin Roshan Nag resolved SPARK-8011.
-
Resolution: Not A Problem

 DecimalType is not a datatype
 -

 Key: SPARK-8011
 URL: https://issues.apache.org/jira/browse/SPARK-8011
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 1.3.1
Reporter: Bipin Roshan Nag

 When I run the following in spark-shell :
  StructType(StructField(ID,IntegerType,true), 
 StructField(Value,DecimalType,true))
 I get
 console:50: error: type mismatch;
  found   : org.apache.spark.sql.types.DecimalType.type
  required: org.apache.spark.sql.types.DataType
StructType(StructField(ID,IntegerType,true), 
 StructField(Value,DecimalType,true))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8011) DecimalType is not a datatype

2015-06-08 Thread Bipin Roshan Nag (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576709#comment-14576709
 ] 

Bipin Roshan Nag commented on SPARK-8011:
-

It works. Thanks.

 DecimalType is not a datatype
 -

 Key: SPARK-8011
 URL: https://issues.apache.org/jira/browse/SPARK-8011
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 1.3.1
Reporter: Bipin Roshan Nag

 When I run the following in spark-shell :
  StructType(StructField(ID,IntegerType,true), 
 StructField(Value,DecimalType,true))
 I get
 console:50: error: type mismatch;
  found   : org.apache.spark.sql.types.DecimalType.type
  required: org.apache.spark.sql.types.DataType
StructType(StructField(ID,IntegerType,true), 
 StructField(Value,DecimalType,true))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >