date:20150615


 [ 
https://issues.apache.org/jira/browse/SPARK-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8283:
---

Assignee: Apache Spark

 udf_struct test failure
 ---

 Key: SPARK-8283
 URL: https://issues.apache.org/jira/browse/SPARK-8283
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Blocker

 {code}
 [info] - udf_struct *** FAILED *** (704 milliseconds)
 [info]   Failed to execute query using catalyst:
 [info]   Error: org.apache.spark.sql.catalyst.expressions.Literal cannot be 
 cast to org.apache.spark.sql.catalyst.expressions.NamedExpression
 [info]   java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.NamedExpression
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$1.apply(complexTypes.scala:64)
 [info]at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]at scala.collection.immutable.List.foreach(List.scala:318)
 [info]at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 [info]at 
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType$lzycompute(complexTypes.scala:64)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:61)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:55)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(ExtractValue.scala:43)
 [info]at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:353)
 [info]at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:340)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 [info]at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:299)
 [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 [info]at 
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 [info]at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 [info]at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 [info]at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 [info]at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 [info]at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 [info]at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8283) udf_struct test failure


[ 
https://issues.apache.org/jira/browse/SPARK-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585661#comment-14585661
 ] 

Apache Spark commented on SPARK-8283:
-

User 'yijieshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6828

 udf_struct test failure
 ---

 Key: SPARK-8283
 URL: https://issues.apache.org/jira/browse/SPARK-8283
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 {code}
 [info] - udf_struct *** FAILED *** (704 milliseconds)
 [info]   Failed to execute query using catalyst:
 [info]   Error: org.apache.spark.sql.catalyst.expressions.Literal cannot be 
 cast to org.apache.spark.sql.catalyst.expressions.NamedExpression
 [info]   java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.NamedExpression
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$1.apply(complexTypes.scala:64)
 [info]at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]at scala.collection.immutable.List.foreach(List.scala:318)
 [info]at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 [info]at 
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType$lzycompute(complexTypes.scala:64)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:61)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:55)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(ExtractValue.scala:43)
 [info]at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:353)
 [info]at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:340)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 [info]at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:299)
 [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 [info]at 
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 [info]at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 [info]at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 [info]at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 [info]at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 [info]at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 [info]at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8283) udf_struct test failure


 [ 
https://issues.apache.org/jira/browse/SPARK-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8283:
---

Assignee: (was: Apache Spark)

 udf_struct test failure
 ---

 Key: SPARK-8283
 URL: https://issues.apache.org/jira/browse/SPARK-8283
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 {code}
 [info] - udf_struct *** FAILED *** (704 milliseconds)
 [info]   Failed to execute query using catalyst:
 [info]   Error: org.apache.spark.sql.catalyst.expressions.Literal cannot be 
 cast to org.apache.spark.sql.catalyst.expressions.NamedExpression
 [info]   java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.NamedExpression
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$1.apply(complexTypes.scala:64)
 [info]at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]at scala.collection.immutable.List.foreach(List.scala:318)
 [info]at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 [info]at 
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType$lzycompute(complexTypes.scala:64)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:61)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:55)
 [info]at 
 org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(ExtractValue.scala:43)
 [info]at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:353)
 [info]at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:340)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 [info]at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
 [info]at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:299)
 [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 [info]at 
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 [info]at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 [info]at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 [info]at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 [info]at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 [info]at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 [info]at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API

2015-06-15 Thread sam (JIRA)

sam created SPARK-8375:
--

 Summary: BinaryClassificationMetrics in ML Lib has odd API
 Key: SPARK-8375
 URL: https://issues.apache.org/jira/browse/SPARK-8375
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: sam


According to 
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it 
should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the 
number of thresholds and the ROC.  I assume this is what 
BinaryClassificationMetrics is doing since it makes no mention of buckets.  In 
a Big Data context this does not make as the number of unique scores may be 
huge.  

Rather user should be able to either specify the number of buckets, or the 
number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or 
another method was added that returned confusion matricies, so that the hard 
integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here 
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2898) Failed to connect to daemon

2015-06-15 Thread Peter Taylor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585809#comment-14585809
 ] 

Peter Taylor commented on SPARK-2898:
-

FYI 

java.io.IOException: Cannot run program python: error=316, Unknown error: 316

I have seen this error to occur on mac because lib/jspawnhelper is missing 
execute permissions in your jre.

 Failed to connect to daemon
 ---

 Key: SPARK-2898
 URL: https://issues.apache.org/jira/browse/SPARK-2898
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.1.0


 There is a deadlock  in handle_sigchld() because of logging
 
 Java options: -Dspark.storage.memoryFraction=0.66 
 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer 
 -Dspark.executor.memory=3g -Dspark.locality.wait=6000
 Options: SchedulerThroughputTest --num-tasks=1 --num-trials=4 
 --inter-trial-wait=1
 
 14/08/06 22:09:41 WARN JettyUtils: Failed to create UI on port 4040. Trying 
 again on port 4041. - Failure(java.net.BindException: Address already in use)
 worker 50114 crashed abruptly with exit status 1
 14/08/06 22:10:37 ERROR Executor: Exception in task 1476.0 in stage 1.0 (TID 
 11476)
 org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:150)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:392)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:101)
   ... 10 more
 14/08/06 22:10:37 WARN PythonWorkerFactory: Failed to open socket to Python 
 daemon:
 java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at java.net.Socket.init(Socket.java:425)
   at java.net.Socket.init(Socket.java:241)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:68)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/08/06 22:10:37 ERROR Executor: Exception in task 1478.0 in stage 1.0 (TID 
 11478)
 java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:392)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:69)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
   at

[jira] [Created] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-15 Thread Shay Rojansky (JIRA)

Shay Rojansky created SPARK-8374:


 Summary: Job frequently hangs after YARN preemption
 Key: SPARK-8374
 URL: https://issues.apache.org/jira/browse/SPARK-8374
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
Reporter: Shay Rojansky
Priority: Critical


After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
not reacquire executors and will therefore hang. To reproduce:

1. I run Spark job A that acquires all grid resources
2. I run Spark job B in a higher-priority queue that acquires all grid 
resources. Job A is fully preempted.
3. Kill job B, releasing all resources
4. Job A should at this point reacquire all grid resources, but occasionally 
doesn't. Repeating the preemption scenario makes the bad behavior occur within 
a few attempts.

(see logs at bottom).

Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
issues, maybe the work there is related to the new issues.

The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
downgraded to 1.3.1 just because of this issue).

Logs
--
When job B (the preemptor first acquires an application master, the following 
is logged by job A (the preemptee):

{noformat}
ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
client disassociated
INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
WARN ReliableDeliverySupervisor: Association with remote system 
[akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
INFO DAGScheduler: Executor lost: 447 (epoch 0)
INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
BlockManagerMaster.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
g023.grid.eaglerd.local, 41406)
INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
{noformat}

(It's strange for errors/warnings to be logged for preemption)

Later, when job B's AM starts requesting its resources, I get lots of the 
following in job A:

{noformat}
ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
client disassociated
INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
WARN ReliableDeliverySupervisor: Association with remote system 
[akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
{noformat}

Finally, when I kill job B, job A emits lots of the following:

{noformat}
INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
{noformat}

And finally after some time:

{noformat}
WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 
ms exceeds timeout 12 ms
ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat 
timed out after 165964 ms
{noformat}

At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API

2015-06-15 Thread sam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-8375:
---
Description: 
According to 
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it 
should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the 
number of thresholds and the ROC.  I assume this is what 
BinaryClassificationMetrics is doing since it makes no mention of buckets.  In 
a Big Data context this does not make sense as the number of unique scores may 
be huge.  

Rather user should be able to either specify the number of buckets, or the 
number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or 
another method was added that returned confusion matricies, so that the hard 
integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here 
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```




  was:
According to 
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it 
should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the 
number of thresholds and the ROC.  I assume this is what 
BinaryClassificationMetrics is doing since it makes no mention of buckets.  In 
a Big Data context this does not make as the number of unique scores may be 
huge.  

Rather user should be able to either specify the number of buckets, or the 
number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or 
another method was added that returned confusion matricies, so that the hard 
integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here 
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```





 BinaryClassificationMetrics in ML Lib has odd API
 -

 Key: SPARK-8375
 URL: https://issues.apache.org/jira/browse/SPARK-8375
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: sam

 According to 
 https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
 The constructor takes `RDD[(Double, Double)]` which does not make sense it 
 should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.
 In scikit I believe they use the number of unique scores to determine the 
 number of thresholds and the ROC.  I assume this is what 
 BinaryClassificationMetrics is doing since it makes no mention of buckets.  
 In a Big Data context this does not make sense as the number of unique scores 
 may be huge.  
 Rather user should be able to either specify the number of buckets, or the 
 number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`
 Finally it would then be good if either the ROC output type was changed or 
 another method was added that returned confusion matricies, so that the hard 
 integer values can be obtained.  E.g.
 ```
 case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
   // bunch of methods for each of the things in the table here 
 https://en.wikipedia.org/wiki/Receiver_operating_characteristic
 }
 ...
 def confusions(numPtsPerBucket: Int): RDD[Confusion]
 ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs

2015-06-15 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-8376:
---

 Summary: Commons Lang 3 is one of the required JAR of Spark Flume 
Sink but is missing in the docs
 Key: SPARK-8376
 URL: https://issues.apache.org/jira/browse/SPARK-8376
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Shixiong Zhu
Priority: Minor


Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since 
https://github.com/apache/spark/pull/5703. However, the docs has not yet 
updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD


[ 
https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585884#comment-14585884
 ] 

Sean Owen commented on SPARK-8373:
--

Really the same as SPARK-6878
https://github.com/apache/spark/commit/51b306b930cfe03ad21af72a3a6ef31e6e626235

 When an RDD has no partition, Python sum will throw Can not reduce() empty 
 RDD
 

 Key: SPARK-8373
 URL: https://issues.apache.org/jira/browse/SPARK-8373
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Shixiong Zhu

 The issue is because sum uses reduce. Replacing it with fold will fix 
 it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6666) org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names


[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585883#comment-14585883
 ] 

Santiago M. Mola commented on SPARK-:
-

I opened SPARK-8377 to track the general case, since I have this problem with 
other data sources, not just JDBC.

 org.apache.spark.sql.jdbc.JDBCRDD  does not escape/quote column names
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment:  
Reporter: John Ferguson
Priority: Critical

 Is there a way to have JDBC DataFrames use quoted/escaped column names?  
 Right now, it looks like it sees the names correctly in the schema created 
 but does not escape them in the SQL it creates when they are not compliant:
 org.apache.spark.sql.jdbc.JDBCRDD
 
 private val columnList: String = {
 val sb = new StringBuilder()
 columns.foreach(x = sb.append(,).append(x))
 if (sb.length == 0) 1 else sb.substring(1)
 }
 If you see value in this, I would take a shot at adding the quoting 
 (escaping) of column names here.  If you don't do it, some drivers... like 
 postgresql's will simply drop case all names when parsing the query.  As you 
 can see in the TL;DR below that means they won't match the schema I am given.
 TL;DR:
  
 I am able to connect to a Postgres database in the shell (with driver 
 referenced):
val jdbcDf = 
 sqlContext.jdbc(jdbc:postgresql://localhost/sparkdemo?user=dbuser, sp500)
 In fact when I run:
jdbcDf.registerTempTable(sp500)
val avgEPSNamed = sqlContext.sql(SELECT AVG(`Earnings/Share`) as AvgCPI 
 FROM sp500)
 and
val avgEPSProg = jsonDf.agg(avg(jsonDf.col(Earnings/Share)))
 The values come back as expected.  However, if I try:
jdbcDf.show
 Or if I try

val all = sqlContext.sql(SELECT * FROM sp500)
all.show
 I get errors about column names not being found.  In fact the error includes 
 a mention of column names all lower cased.  For now I will change my schema 
 to be more restrictive.  Right now it is, per a Stack Overflow poster, not 
 ANSI compliant by doing things that are allowed by 's in pgsql, MySQL and 
 SQLServer.  BTW, our users are giving us tables like this... because various 
 tools they already use support non-compliant names.  In fact, this is mild 
 compared to what we've had to support.
 Currently the schema in question uses mixed case, quoted names with special 
 characters and spaces:
 CREATE TABLE sp500
 (
 Symbol text,
 Name text,
 Sector text,
 Price double precision,
 Dividend Yield double precision,
 Price/Earnings double precision,
 Earnings/Share double precision,
 Book Value double precision,
 52 week low double precision,
 52 week high double precision,
 Market Cap double precision,
 EBITDA double precision,
 Price/Sales double precision,
 Price/Book double precision,
 SEC Filings text
 ) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4644) Implement skewed join

2015-06-15 Thread Nathan McCarthy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586076#comment-14586076
 ] 

Nathan McCarthy commented on SPARK-4644:


Something like this to make working with skewed data in spark easier would be 
very helpful 

 Implement skewed join
 -

 Key: SPARK-4644
 URL: https://issues.apache.org/jira/browse/SPARK-4644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
 Attachments: Skewed Join Design Doc.pdf


 Skewed data is not rare. For example, a book recommendation site may have 
 several books which are liked by most of the users. Running ALS on such 
 skewed data will raise a OutOfMemory error, if some book has too many users 
 which cannot be fit into memory. To solve it, we propose a skewed join 
 implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8377) Identifiers caseness information should be available at any time

Santiago M. Mola created SPARK-8377:
---

 Summary: Identifiers caseness information should be available at 
any time
 Key: SPARK-8377
 URL: https://issues.apache.org/jira/browse/SPARK-8377
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Santiago M. Mola


Currently, we have the option of having a case sensitive catalog or not. A case 
insensitive catalog just lowercases all identifiers. However, when pushing down 
to a data source, we lose the information about if an identifier should be case 
insensitive or strictly lowercase.

Ideally, we would be able to distinguish a case insensitive identifier from a 
case sensitive one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API


 [ 
https://issues.apache.org/jira/browse/SPARK-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8375.
--
Resolution: Invalid

@sam This is a discussion for the mailing list rather than a JIRA.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

You're looking at an API from 4 versions ago, too.
https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The input are scores and ground-truth labels. I agree with the problem of many 
distinct values, but, this is part of the newer API.

 BinaryClassificationMetrics in ML Lib has odd API
 -

 Key: SPARK-8375
 URL: https://issues.apache.org/jira/browse/SPARK-8375
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: sam

 According to 
 https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
 The constructor takes `RDD[(Double, Double)]` which does not make sense it 
 should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.
 In scikit I believe they use the number of unique scores to determine the 
 number of thresholds and the ROC.  I assume this is what 
 BinaryClassificationMetrics is doing since it makes no mention of buckets.  
 In a Big Data context this does not make sense as the number of unique scores 
 may be huge.  
 Rather user should be able to either specify the number of buckets, or the 
 number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`
 Finally it would then be good if either the ROC output type was changed or 
 another method was added that returned confusion matricies, so that the hard 
 integer values can be obtained.  E.g.
 ```
 case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
   // bunch of methods for each of the things in the table here 
 https://en.wikipedia.org/wiki/Receiver_operating_characteristic
 }
 ...
 def confusions(numPtsPerBucket: Int): RDD[Confusion]
 ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD


 [ 
https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8373:
-
Priority: Minor  (was: Major)

 When an RDD has no partition, Python sum will throw Can not reduce() empty 
 RDD
 

 Key: SPARK-8373
 URL: https://issues.apache.org/jira/browse/SPARK-8373
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Shixiong Zhu
Priority: Minor

 The issue is because sum uses reduce. Replacing it with fold will fix 
 it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8253) string function: ltrim


 [ 
https://issues.apache.org/jira/browse/SPARK-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8253:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: ltrim
 --

 Key: SPARK-8253
 URL: https://issues.apache.org/jira/browse/SPARK-8253
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 ltrim(string A): string
 Returns the string resulting from trimming spaces from the beginning(left 
 hand side) of A. For example, ltrim(' foobar ') results in 'foobar '.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8260) string function: rtrim


[ 
https://issues.apache.org/jira/browse/SPARK-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585934#comment-14585934
 ] 

Apache Spark commented on SPARK-8260:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6762

 string function: rtrim
 --

 Key: SPARK-8260
 URL: https://issues.apache.org/jira/browse/SPARK-8260
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 rtrim(string A): string
 Returns the string resulting from trimming spaces from the end(right hand 
 side) of A. For example, rtrim(' foobar ') results in ' foobar'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8267) string function: trim


 [ 
https://issues.apache.org/jira/browse/SPARK-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8267:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: trim
 -

 Key: SPARK-8267
 URL: https://issues.apache.org/jira/browse/SPARK-8267
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 trim(string A): string
 Returns the string resulting from trimming spaces from both ends of A. For 
 example, trim(' foobar ') results in 'foobar'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8260) string function: rtrim


 [ 
https://issues.apache.org/jira/browse/SPARK-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8260:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: rtrim
 --

 Key: SPARK-8260
 URL: https://issues.apache.org/jira/browse/SPARK-8260
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 rtrim(string A): string
 Returns the string resulting from trimming spaces from the end(right hand 
 side) of A. For example, rtrim(' foobar ') results in ' foobar'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8267) string function: trim


 [ 
https://issues.apache.org/jira/browse/SPARK-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8267:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: trim
 -

 Key: SPARK-8267
 URL: https://issues.apache.org/jira/browse/SPARK-8267
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 trim(string A): string
 Returns the string resulting from trimming spaces from both ends of A. For 
 example, trim(' foobar ') results in 'foobar'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8253) string function: ltrim


[ 
https://issues.apache.org/jira/browse/SPARK-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585933#comment-14585933
 ] 

Apache Spark commented on SPARK-8253:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6762

 string function: ltrim
 --

 Key: SPARK-8253
 URL: https://issues.apache.org/jira/browse/SPARK-8253
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 ltrim(string A): string
 Returns the string resulting from trimming spaces from the beginning(left 
 hand side) of A. For example, ltrim(' foobar ') results in 'foobar '.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8253) string function: ltrim


 [ 
https://issues.apache.org/jira/browse/SPARK-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8253:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: ltrim
 --

 Key: SPARK-8253
 URL: https://issues.apache.org/jira/browse/SPARK-8253
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 ltrim(string A): string
 Returns the string resulting from trimming spaces from the beginning(left 
 hand side) of A. For example, ltrim(' foobar ') results in 'foobar '.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8267) string function: trim


[ 
https://issues.apache.org/jira/browse/SPARK-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585935#comment-14585935
 ] 

Apache Spark commented on SPARK-8267:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6762

 string function: trim
 -

 Key: SPARK-8267
 URL: https://issues.apache.org/jira/browse/SPARK-8267
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 trim(string A): string
 Returns the string resulting from trimming spaces from both ends of A. For 
 example, trim(' foobar ') results in 'foobar'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8260) string function: rtrim


 [ 
https://issues.apache.org/jira/browse/SPARK-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8260:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: rtrim
 --

 Key: SPARK-8260
 URL: https://issues.apache.org/jira/browse/SPARK-8260
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 rtrim(string A): string
 Returns the string resulting from trimming spaces from the end(right hand 
 side) of A. For example, rtrim(' foobar ') results in ' foobar'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs


[ 
https://issues.apache.org/jira/browse/SPARK-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585833#comment-14585833
 ] 

Apache Spark commented on SPARK-8376:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6829

 Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing 
 in the docs
 

 Key: SPARK-8376
 URL: https://issues.apache.org/jira/browse/SPARK-8376
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Shixiong Zhu
Priority: Minor

 Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since 
 https://github.com/apache/spark/pull/5703. However, the docs has not yet 
 updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs


 [ 
https://issues.apache.org/jira/browse/SPARK-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8376:
---

Assignee: Apache Spark

 Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing 
 in the docs
 

 Key: SPARK-8376
 URL: https://issues.apache.org/jira/browse/SPARK-8376
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Shixiong Zhu
Assignee: Apache Spark
Priority: Minor

 Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since 
 https://github.com/apache/spark/pull/5703. However, the docs has not yet 
 updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs


 [ 
https://issues.apache.org/jira/browse/SPARK-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8376:
---

Assignee: (was: Apache Spark)

 Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing 
 in the docs
 

 Key: SPARK-8376
 URL: https://issues.apache.org/jira/browse/SPARK-8376
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Shixiong Zhu
Priority: Minor

 Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since 
 https://github.com/apache/spark/pull/5703. However, the docs has not yet 
 updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8378) Add Spark Flume Python API


 [ 
https://issues.apache.org/jira/browse/SPARK-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8378:
---

Assignee: (was: Apache Spark)

 Add Spark Flume Python API
 --

 Key: SPARK-8378
 URL: https://issues.apache.org/jira/browse/SPARK-8378
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8378) Add Spark Flume Python API


 [ 
https://issues.apache.org/jira/browse/SPARK-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8378:
---

Assignee: Apache Spark

 Add Spark Flume Python API
 --

 Key: SPARK-8378
 URL: https://issues.apache.org/jira/browse/SPARK-8378
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8350) R unit tests output should be logged to unit-tests.log

2015-06-15 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8350.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6807
[https://github.com/apache/spark/pull/6807]

 R unit tests output should be logged to unit-tests.log
 

 Key: SPARK-8350
 URL: https://issues.apache.org/jira/browse/SPARK-8350
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor
 Fix For: 1.5.0


 Right now it's logged to R-unit-tests.log. Jenkins currently only archives 
 files named unit-tests.log, and this is what all other modules (e.g. SQL, 
 network, REPL) use.
 1. We should be consistent
 2. I don't want to reconfigure Jenkins to accept a different file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5081) Shuffle write increases

2015-06-15 Thread Roi Reshef (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roi Reshef updated SPARK-5081:
--
Comment: was deleted

(was: Hi Guys,
Was this issue already solved by any chance? I'm using Spark 1.3.1 for training 
algorithm with an iterative fashion. Since implementing a ranking measure (that 
ultimately uses sortBy) i'm experiencing similar problems. It seems that my 
cache explodes after ~100 iterations, and crushes the server with a There is 
insufficient memory for the Java Runtime Environment to continue message. Note 
that it isn't supposed to persist the sorted vectors nor to use them in the 
following iterations. So I wonder why memory consumption keeps growing with 
each iteration.)

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung
Priority: Critical
 Attachments: Spark_Debug.pdf, diff.txt


 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8378) Add Spark Flume Python API


[ 
https://issues.apache.org/jira/browse/SPARK-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586144#comment-14586144
 ] 

Apache Spark commented on SPARK-8378:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6830

 Add Spark Flume Python API
 --

 Key: SPARK-8378
 URL: https://issues.apache.org/jira/browse/SPARK-8378
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!

2015-06-15 Thread Sebastian Walz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586248#comment-14586248
 ] 

Sebastian Walz commented on SPARK-8335:
---

Yeah I am sure, that is a really a scala.Double. I just looked it up again on 
github. So the problem still exists in on the current master branch. 

 DecisionTreeModel.predict() return type not convenient!
 ---

 Key: SPARK-8335
 URL: https://issues.apache.org/jira/browse/SPARK-8335
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Sebastian Walz
Priority: Minor
  Labels: easyfix, machine_learning
   Original Estimate: 10m
  Remaining Estimate: 10m

 org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method:
 def predict(features: JavaRDD[Vector]): JavaRDD[Double]
 The problem here is the generic type of the return type JAVARDD[Double] 
 because its a scala Double and I would expect a java.lang.Double. (to be 
 convenient e.g. with 
 org.apache.spark.mllib.classification.ClassificationModel)
 I wanted to extend the DecisionTreeModel and use it only for Binary 
 Classification and wanted to implement the trait 
 org.apache.spark.mllib.classification.ClassificationModel . But its not 
 possible because the ClassificationModel already defines the predict method 
 but with an return type JAVARDD[java.lang.Double]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8378) Add Spark Flume Python API

2015-06-15 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-8378:
---

 Summary: Add Spark Flume Python API
 Key: SPARK-8378
 URL: https://issues.apache.org/jira/browse/SPARK-8378
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution

2015-06-15 Thread jeanlyn (JIRA)

jeanlyn created SPARK-8379:
--

 Summary: LeaseExpiredException when using dynamic partition with 
speculative execution
 Key: SPARK-8379
 URL: https://issues.apache.org/jira/browse/SPARK-8379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.3.1, 1.3.0
Reporter: jeanlyn


when inserting to table using dynamic partitions with *spark.speculation=true*  
and there is a skew data of some partitions trigger the speculative tasks ,it 
will throws the exception like
{code}
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 Lease mismatch on 
/tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
 owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is 
accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7104) Support model save/load in Python's Word2Vec

2015-06-15 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585488#comment-14585488
 ] 

Yu Ishikawa commented on SPARK-7104:


It would be nice to refactor Python's Word2Vec. And that would fit for another 
issue.
Because we can call directly Scala's model API to have a good 
maintenanceability instead of {{Word2VecModelWrapper}}'s API, 

 Support model save/load in Python's Word2Vec
 

 Key: SPARK-7104
 URL: https://issues.apache.org/jira/browse/SPARK-7104
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore


[ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585489#comment-14585489
 ] 

Apache Spark commented on SPARK-7550:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/5733

 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin
Assignee: Cheng Hao

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution


 [ 
https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8379:
---

Assignee: Apache Spark

 LeaseExpiredException when using dynamic partition with speculative execution
 -

 Key: SPARK-8379
 URL: https://issues.apache.org/jira/browse/SPARK-8379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: jeanlyn
Assignee: Apache Spark

 when inserting to table using dynamic partitions with 
 *spark.speculation=true*  and there is a skew data of some partitions trigger 
 the speculative tasks ,it will throws the exception like
 {code}
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  Lease mismatch on 
 /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
  owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but 
 is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution


 [ 
https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8379:
---

Assignee: (was: Apache Spark)

 LeaseExpiredException when using dynamic partition with speculative execution
 -

 Key: SPARK-8379
 URL: https://issues.apache.org/jira/browse/SPARK-8379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: jeanlyn

 when inserting to table using dynamic partitions with 
 *spark.speculation=true*  and there is a skew data of some partitions trigger 
 the speculative tasks ,it will throws the exception like
 {code}
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  Lease mismatch on 
 /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
  owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but 
 is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution


[ 
https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587322#comment-14587322
 ] 

Apache Spark commented on SPARK-8379:
-

User 'jeanlyn' has created a pull request for this issue:
https://github.com/apache/spark/pull/6833

 LeaseExpiredException when using dynamic partition with speculative execution
 -

 Key: SPARK-8379
 URL: https://issues.apache.org/jira/browse/SPARK-8379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: jeanlyn

 when inserting to table using dynamic partitions with 
 *spark.speculation=true*  and there is a skew data of some partitions trigger 
 the speculative tasks ,it will throws the exception like
 {code}
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  Lease mismatch on 
 /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
  owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but 
 is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8386) DataFrame and JDBC regression

2015-06-15 Thread Peter Haumer (JIRA)

Peter Haumer created SPARK-8386:
---

 Summary: DataFrame and JDBC regression
 Key: SPARK-8386
 URL: https://issues.apache.org/jira/browse/SPARK-8386
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: RHEL 7.1
Reporter: Peter Haumer
Priority: Critical


I have an ETL app that appends to a JDBC table new results found at each run.  
In 1.3.1 I did this:

testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);

When I do this now in 1.4 it complains that the object 'TABLE_NAME' already 
exists. I get this even if I switch the overwrite to true.  I also tried this 
now:

testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, 
connectionProperties);

getting the same error. It works running the first time creating the new table 
and adding data successfully. But, running it a second time it (the jdbc 
driver) will tell me that the table already exists. Even SaveMode.Overwrite 
will give me the same error. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6724) Model import/export for FPGrowth

2015-06-15 Thread Hrishikesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583139#comment-14583139
 ] 

Hrishikesh edited comment on SPARK-6724 at 6/16/15 4:22 AM:


[~josephkb],  please assign this ticket to me.


was (Author: hrishikesh91):
[~josephkb], please assign this ticket to me.

 Model import/export for FPGrowth
 

 Key: SPARK-6724
 URL: https://issues.apache.org/jira/browse/SPARK-6724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server

2015-06-15 Thread zhangyouhua (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587299#comment-14587299
 ] 

zhangyouhua commented on SPARK-6932:


@Qiping Li
in your idea the PS client run in slave node, but where the PS Server will run 
or deploy?

 A Prototype of Parameter Server
 ---

 Key: SPARK-6932
 URL: https://issues.apache.org/jira/browse/SPARK-6932
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib, Spark Core
Reporter: Qiping Li

  h2. Introduction
 As specified in 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
 very helpful to integrate parameter server into Spark for machine learning 
 algorithms, especially for those with ultra high dimensions features. 
 After carefully studying the design doc of [Parameter 
 Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
 we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
 several key design concerns:
 * *User friendly interface*
   Careful investigation is done to most existing Parameter Server 
 systems(including:  [petuum|http://petuum.github.io], [parameter 
 server|http://parameterserver.org], 
 [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
 design by absorbing essence from all these system. 
 * *Prototype of distributed array*
 IndexRDD (see 
 [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
 to be a good option for distributed array, because in most case, the #key 
 updates/second is not be very high. 
 So we implement a distributed HashMap to store the parameters, which can 
 be easily extended to get better performance.
 
 * *Minimal code change*
   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
 which need parameter server are still created and scheduled by Spark's 
 scheduler. Tasks communicate with parameter server with a client object, 
 through *akka* or *netty*.
 With all these concerns we propose the following architecture:
 h2. Architecture
 !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
 Data is stored in RDD and is partitioned across workers. During each 
 iteration, each worker gets parameters from parameter server then computes 
 new parameters based on old parameters and data in the partition. Finally 
 each worker updates parameters to parameter server.Worker communicates with 
 parameter server through a parameter server client,which is initialized in 
 `TaskContext` of this worker.
 The current implementation is based on YARN cluster mode, 
 but it should not be a problem to transplanted it to other modes. 
 h3. Interface
 We refer to existing parameter server systems(petuum, parameter server, 
 paracel) when design the interface of parameter server. 
 *`PSClient` provides the following interface for workers to use:*
 {code}
 //  get parameter indexed by key from parameter server
 def get[T](key: String): T
 // get multiple parameters from parameter server
 def multiGet[T](keys: Array[String]): Array[T]
 // add parameter indexed by `key` by `delta`, 
 // if multiple `delta` to update on the same parameter,
 // use `reduceFunc` to reduce these `delta`s frist.
 def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit
 // update multiple parameters at the same time, use the same `reduceFunc`.
 def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = 
 T: Unit
 
 // advance clock to indicate that current iteration is finished.
 def clock(): Unit
  
 // block until all workers have reached this line of code.
 def sync(): Unit
 {code}
 *`PSContext` provides following functions to use on driver:*
 {code}
 // load parameters from existing rdd.
 def loadPSModel[T](model: RDD[String, T]) 
 // fetch parameters from parameter server to construct model.
 def fetchPSModel[T](keys: Array[String]): Array[T]
 {code} 
 
 *A new function has been add to `RDD` to run parameter server tasks:*
 {code}
 // run the provided `func` on each partition of this RDD. 
 // This function can use data of this partition(the first argument) 
 // and a parameter server client(the second argument). 
 // See the following Logistic Regression for an example.
 def runWithPS[U: ClassTag](func: (Array[T], PSClient) = U): Array[U]

 {code}
 h2. Example
 Here is an example of using our prototype to implement logistic regression:
 {code:title=LogisticRegression.scala|borderStyle=solid}
 def train(
 sc: SparkContext,
 input: RDD[LabeledPoint],
 numIterations: Int,
 stepSize: Double,
 miniBatchFraction: Double): LogisticRegressionModel = {
 
 //

[jira] [Updated] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-15 Thread CHEN Zhiwei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CHEN Zhiwei updated SPARK-8368:
---
Description: 
After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
following exception:
==begin exception
{quote}
Exception in thread main java.lang.ClassNotFoundException: 
com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at 
org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
at 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
 Source)
at 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
 Source)
at 
org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.map(RDD.scala:293)
at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
at com.yhd.ycache.magic.Model.main(SSExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{quote}
===end exception===

I simplify the code that cause this issue, as following:
==begin code==
{noformat}
object Model extends Serializable{
  def main(args: Array[String]) {
val Array(sql) = args
val sparkConf = new SparkConf().setAppName(Mode Example)
val sc = new SparkContext(sparkConf)
val hive = new HiveContext(sc)
//get data by hive sql
val rows = hive.sql(sql)

val data = rows.map(r = { 
  val arr = r.toSeq.toArray
  val label = 1.0
  def fmap = ( input: Any ) = 1.0
  val feature = arr.map(_=1.0)
  LabeledPoint(label, Vectors.dense(feature))
})

data.count()
  }
}
{noformat}
=end code===
This code can run pretty well on spark-shell, but error when submit it to spark 
cluster (standalone or local mode).  I try the same code on spark 1.3.0(local 
mode), and no exception is encountered.

  was:
After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
following exception:
==begin exception
{quote}
Exception in thread main java.lang.ClassNotFoundException: 
com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at 
org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
at

[jira] [Commented] (SPARK-8275) HistoryServer caches incomplete App UIs

2015-06-15 Thread Carson Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587360#comment-14587360
 ] 

Carson Wang commented on SPARK-8275:


This seems to be the same issue to SPARK-7889

 HistoryServer caches incomplete App UIs
 ---

 Key: SPARK-8275
 URL: https://issues.apache.org/jira/browse/SPARK-8275
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.1
Reporter: Steve Loughran

 The history server caches applications retrieved from the 
 {{ApplicationHistoryProvider.getAppUI()}} call for performance: it's 
 expensive to rebuild.
 However, this cache also includes incomplete applications, as well as 
 completed ones —and it never attempts to refresh the incomplete application.
 As a result, if you do a GET of the history of a running application, even 
 after the application is finished, you'll still get the web UI/history as it 
 was when that first GET was issued.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all


 [ 
https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8387:
---

Assignee: (was: Apache Spark)

 [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
 -

 Key: SPARK-8387
 URL: https://issues.apache.org/jira/browse/SPARK-8387
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.4.0
Reporter: SuYan
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all


[ 
https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587421#comment-14587421
 ] 

Apache Spark commented on SPARK-8387:
-

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/6834

 [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
 -

 Key: SPARK-8387
 URL: https://issues.apache.org/jira/browse/SPARK-8387
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.4.0
Reporter: SuYan
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-15 Thread Peter Haumer (JIRA)

Peter Haumer created SPARK-8385:
---

 Summary: java.lang.UnsupportedOperationException: Not implemented 
by the TFS FileSystem implementation
 Key: SPARK-8385
 URL: https://issues.apache.org/jira/browse/SPARK-8385
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: RHEL 7.1
Reporter: Peter Haumer


I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I created 
a launch and just set the vm var -Dspark.master=local[4].  
With 1.4 this stopped working when reading files from the OS filesystem. 
Running the same apps with spark-submit works fine.  Loosing the ability to 
debug that way has a major impact on the usability of Spark.

The following exception is thrown:

Exception in thread main java.lang.UnsupportedOperationException: Not 
implemented by the TFS FileSystem implementation
at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389)
at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:900)
at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357)
at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46)
at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60)







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8280) udf7 failed due to null vs nan semantics


 [ 
https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8280:
---

Assignee: Apache Spark

 udf7 failed due to null vs nan semantics
 

 Key: SPARK-8280
 URL: https://issues.apache.org/jira/browse/SPARK-8280
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Blocker

 To execute
 {code}
 sbt/sbt -Phive -Dspark.hive.whitelist=udf7.* hive/test-only 
 org.apache.spark.sql.hive.execution.HiveCompatibilitySuite
 {code}
 If we want to be consistent with Hive, we need to special case our log 
 function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8281) udf_asin and udf_acos test failure


[ 
https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587462#comment-14587462
 ] 

Apache Spark commented on SPARK-8281:
-

User 'yijieshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6835

 udf_asin and udf_acos test failure
 --

 Key: SPARK-8281
 URL: https://issues.apache.org/jira/browse/SPARK-8281
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 acos/asin in Hive returns NaN for not a number, whereas we always return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8281) udf_asin and udf_acos test failure


 [ 
https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8281:
---

Assignee: (was: Apache Spark)

 udf_asin and udf_acos test failure
 --

 Key: SPARK-8281
 URL: https://issues.apache.org/jira/browse/SPARK-8281
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 acos/asin in Hive returns NaN for not a number, whereas we always return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8280) udf7 failed due to null vs nan semantics


[ 
https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587461#comment-14587461
 ] 

Apache Spark commented on SPARK-8280:
-

User 'yijieshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6835

 udf7 failed due to null vs nan semantics
 

 Key: SPARK-8280
 URL: https://issues.apache.org/jira/browse/SPARK-8280
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 To execute
 {code}
 sbt/sbt -Phive -Dspark.hive.whitelist=udf7.* hive/test-only 
 org.apache.spark.sql.hive.execution.HiveCompatibilitySuite
 {code}
 If we want to be consistent with Hive, we need to special case our log 
 function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8280) udf7 failed due to null vs nan semantics


 [ 
https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8280:
---

Assignee: (was: Apache Spark)

 udf7 failed due to null vs nan semantics
 

 Key: SPARK-8280
 URL: https://issues.apache.org/jira/browse/SPARK-8280
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 To execute
 {code}
 sbt/sbt -Phive -Dspark.hive.whitelist=udf7.* hive/test-only 
 org.apache.spark.sql.hive.execution.HiveCompatibilitySuite
 {code}
 If we want to be consistent with Hive, we need to special case our log 
 function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8281) udf_asin and udf_acos test failure


 [ 
https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8281:
---

Assignee: Apache Spark

 udf_asin and udf_acos test failure
 --

 Key: SPARK-8281
 URL: https://issues.apache.org/jira/browse/SPARK-8281
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Blocker

 acos/asin in Hive returns NaN for not a number, whereas we always return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8280) udf7 failed due to null vs nan semantics

2015-06-15 Thread Yijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587335#comment-14587335
 ] 

Yijie Shen commented on SPARK-8280:
---

I'll take this

 udf7 failed due to null vs nan semantics
 

 Key: SPARK-8280
 URL: https://issues.apache.org/jira/browse/SPARK-8280
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 To execute
 {code}
 sbt/sbt -Phive -Dspark.hive.whitelist=udf7.* hive/test-only 
 org.apache.spark.sql.hive.execution.HiveCompatibilitySuite
 {code}
 If we want to be consistent with Hive, we need to special case our log 
 function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8281) udf_asin and udf_acos test failure

2015-06-15 Thread Yijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587333#comment-14587333
 ] 

Yijie Shen commented on SPARK-8281:
---

I'll take this

 udf_asin and udf_acos test failure
 --

 Key: SPARK-8281
 URL: https://issues.apache.org/jira/browse/SPARK-8281
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 acos/asin in Hive returns NaN for not a number, whereas we always return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package

2015-06-15 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-7888:
---
Assignee: holdenk

 Be able to disable intercept in Linear Regression in ML package
 ---

 Key: SPARK-7888
 URL: https://issues.apache.org/jira/browse/SPARK-7888
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: holdenk





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8206) math function: round


[ 
https://issues.apache.org/jira/browse/SPARK-8206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587509#comment-14587509
 ] 

Apache Spark commented on SPARK-8206:
-

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/6836

 math function: round
 

 Key: SPARK-8206
 URL: https://issues.apache.org/jira/browse/SPARK-8206
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: zhichao-li

 round(double a): double
 Returns the rounded BIGINT value of a.
 round(double a, INT d): double
 Returns a rounded to d decimal places.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7633) Streaming Logistic Regression- Python bindings

2015-06-15 Thread Mike Dusenberry (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587286#comment-14587286
 ] 

Mike Dusenberry commented on SPARK-7633:


I can work on this one!

 Streaming Logistic Regression- Python bindings
 --

 Key: SPARK-7633
 URL: https://issues.apache.org/jira/browse/SPARK-7633
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang

 Add Python API for StreamingLogisticRegressionWithSGD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7674) R-like stats for ML models

2015-06-15 Thread holdenk (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587376#comment-14587376
]

holdenk commented on SPARK-7674:

I'd love to help with this if thats cool :)

R-like stats for ML models
--

Key: SPARK-7674
URL: https://issues.apache.org/jira/browse/SPARK-7674
Project: Spark
Issue Type: New Feature
Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

This is an umbrella JIRA for supporting ML model summaries and statistics,
following the example of R's summary() and plot() functions.
[Design
doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]
From the design doc:
{quote}
R and its well-established packages provide extensive functionality for
inspecting a model and its results. This inspection is critical to
interpreting, debugging and improving models.
R is arguably a gold standard for a statistics/ML library, so this doc
largely attempts to imitate it. The challenge we face is supporting similar
functionality, but on big (distributed) data. Data size makes both efficient
computation and meaningful displays/summaries difficult.
R model and result summaries generally take 2 forms:
* summary(model): Display text with information about the model and results
on data
* plot(model): Display plots about the model and results
We aim to provide both of these types of information. Visualization for the
plottable results will not be supported in MLlib itself, but we can provide
results in a form which can be plotted easily with other tools.
{quote}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all


 [ 
https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8387:
---

Assignee: Apache Spark

 [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
 -

 Key: SPARK-8387
 URL: https://issues.apache.org/jira/browse/SPARK-8387
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.4.0
Reporter: SuYan
Assignee: Apache Spark
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-15 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587459#comment-14587459
 ] 

Yin Huai commented on SPARK-8368:
-

@CHEN Zhiwei How was the application submitted?

 ClassNotFoundException in closure for map 
 --

 Key: SPARK-8368
 URL: https://issues.apache.org/jira/browse/SPARK-8368
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the 
 project on Windows 7 and run in a spark standalone cluster(or local) mode on 
 Centos 6.X. 
Reporter: CHEN Zhiwei

 After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
 following exception:
 ==begin exception
 {quote}
 Exception in thread main java.lang.ClassNotFoundException: 
 com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:278)
   at 
 org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
   at org.apache.spark.rdd.RDD.map(RDD.scala:293)
   at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
   at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
   at com.yhd.ycache.magic.Model.main(SSExample.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {quote}
 ===end exception===
 I simplify the code that cause this issue, as following:
 ==begin code==
 {noformat}
 object Model extends Serializable{
   def main(args: Array[String]) {
 val Array(sql) = args
 val sparkConf = new SparkConf().setAppName(Mode Example)
 val sc = new SparkContext(sparkConf)
 val hive = new HiveContext(sc)
 //get data by hive sql
 val rows = hive.sql(sql)
 val data = rows.map(r = { 
   val arr = r.toSeq.toArray
   val label = 1.0
   def fmap = ( input: Any ) = 1.0
   val feature = arr.map(_=1.0)
   LabeledPoint(label, Vectors.dense(feature))
 })
 data.count()
   }
 }
 {noformat}
 =end code===
 This code can run pretty well on spark-shell, but error when submit it to 
 spark cluster (standalone or local mode).  I try the same code on spark 
 1.3.0(local mode), and no exception is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-15 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587459#comment-14587459
 ] 

Yin Huai edited comment on SPARK-8368 at 6/16/15 4:35 AM:
--

[~zwChan] How was the application submitted?


was (Author: yhuai):
@CHEN Zhiwei How was the application submitted?

 ClassNotFoundException in closure for map 
 --

 Key: SPARK-8368
 URL: https://issues.apache.org/jira/browse/SPARK-8368
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the 
 project on Windows 7 and run in a spark standalone cluster(or local) mode on 
 Centos 6.X. 
Reporter: CHEN Zhiwei

 After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
 following exception:
 ==begin exception
 {quote}
 Exception in thread main java.lang.ClassNotFoundException: 
 com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:278)
   at 
 org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
   at org.apache.spark.rdd.RDD.map(RDD.scala:293)
   at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
   at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
   at com.yhd.ycache.magic.Model.main(SSExample.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {quote}
 ===end exception===
 I simplify the code that cause this issue, as following:
 ==begin code==
 {noformat}
 object Model extends Serializable{
   def main(args: Array[String]) {
 val Array(sql) = args
 val sparkConf = new SparkConf().setAppName(Mode Example)
 val sc = new SparkContext(sparkConf)
 val hive = new HiveContext(sc)
 //get data by hive sql
 val rows = hive.sql(sql)
 val data = rows.map(r = { 
   val arr = r.toSeq.toArray
   val label = 1.0
   def fmap = ( input: Any ) = 1.0
   val feature = arr.map(_=1.0)
   LabeledPoint(label, Vectors.dense(feature))
 })
 data.count()
   }
 }
 {noformat}
 =end code===
 This code can run pretty well on spark-shell, but error when submit it to 
 spark cluster (standalone or local mode).  I try the same code on spark 
 1.3.0(local mode), and no exception is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks

2015-06-15 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587491#comment-14587491
 ] 

Yin Huai commented on SPARK-7837:
-

Seems https://www.mail-archive.com/user@spark.apache.org/msg30327.html is about 
the same issue.

 NPE when save as parquet in speculative tasks
 -

 Key: SPARK-7837
 URL: https://issues.apache.org/jira/browse/SPARK-7837
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Priority: Critical

 The query is like {{df.orderBy(...).saveAsTable(...)}}.
 When there is no partitioning columns and there is a skewed key, I found the 
 following exception in speculative tasks. After these failures, seems we 
 could not call {{SparkHadoopMapRedUtil.commitTask}} correctly.
 {code}
 java.lang.NullPointerException
   at 
 parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
   at 
 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
   at 
 org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115)
   at 
 org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type


 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-8381:

Description: This method CatalystTypeConverters.convertToCatalyst is slow, 
so for batch conversion we should be using converter produced by 
createToCatalystConverter.  (was: This method 
CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we 
should be using converter produced by createToCatalystConverter.)

 reuse typeConvert when convert Seq[Row] to catalyst type
 

 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang

 This method CatalystTypeConverters.convertToCatalyst is slow, so for batch 
 conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type


 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8381:
---

Assignee: Apache Spark

 reuse typeConvert when convert Seq[Row] to catalyst type
 

 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang
Assignee: Apache Spark

 This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
 conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type


 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8381:
---

Assignee: (was: Apache Spark)

 reuse typeConvert when convert Seq[Row] to catalyst type
 

 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang

 This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
 conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8382) Improve Analysis Unit test framework

2015-06-15 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-8382:
---

 Summary: Improve Analysis Unit test framework
 Key: SPARK-8382
 URL: https://issues.apache.org/jira/browse/SPARK-8382
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust


We have some nice frameworks for doing various unit test {{checkAnswer}}, 
{{comparePlan}}, {{checkEvaluation}}, etc.  However {{AnalysisSuite}} is kind 
of sloppy with each test using assertions in different ways.  I'd like a 
function that looks something like the following:

{code}
def checkAnalysis(
  inputPlan: LogicalPlan,
  expectedPlan: LogicalPlan = null,
  caseInsensitiveOnly: Boolean = false,
  expectedErrors: Seq[String] = Nil)
{code}

This function should construct tests that check the Analyzer works as expected 
and provides useful error messages when any failures are encountered.  We 
should then rewrite the existing tests and beef up our coverage here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6583) Support aggregated function in order by

2015-06-15 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6583.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6816
[https://github.com/apache/spark/pull/6816]

 Support aggregated function in order by
 ---

 Key: SPARK-6583
 URL: https://issues.apache.org/jira/browse/SPARK-6583
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-06-15 Thread Daniel LaBar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586276#comment-14586276
 ] 

Daniel LaBar commented on SPARK-6220:
-

[~nchammas], I also need IAM support and [made a few changes to 
spark_ec2.py|https://github.com/dnlbrky/spark/commit/5d4a9c65728245dc501c2a7c479ca27b6f685bd8],
 including an {{--instance-profile-name}} option.  These modifications let me 
successfully create security groups and the master/slaves without specifying an 
access key and secret, but I'm still having issues getting Hadoop/Yarn setup so 
it may require further changes.  Please let me know if you have suggestions.

This would be my first time contributing to an Apache project and I'm new to 
Spark/Python, so please forgive my greenness... Should I create another JIRA 
specifically to add instance profile support, or can I reference this JIRA when 
submitting a pull request?

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type


[ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586330#comment-14586330
 ] 

Apache Spark commented on SPARK-8381:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6831

 reuse typeConvert when convert Seq[Row] to catalyst type
 

 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang

 This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
 conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2015-06-15 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586280#comment-14586280
 ] 

Josh Rosen commented on SPARK-7721:
---

We now have the Jenkins HTML publisher plugin installed, so we can now easily 
publish HTML reports from tools from coverage.py 
(https://wiki.jenkins-ci.org/display/JENKINS/HTML+Publisher+Plugin).  I might 
give this a try on NewSparkPullRequestBuilder today. 

 Generate test coverage report from Python
 -

 Key: SPARK-7721
 URL: https://issues.apache.org/jira/browse/SPARK-7721
 Project: Spark
  Issue Type: Test
  Components: PySpark, Tests
Reporter: Reynold Xin

 Would be great to have test coverage report for Python. Compared with Scala, 
 it is tricker to understand the coverage without coverage reports in Python 
 because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8380) SparkR mis-counts

2015-06-15 Thread Rick Moritz (JIRA)

Rick Moritz created SPARK-8380:
--

 Summary: SparkR mis-counts
 Key: SPARK-8380
 URL: https://issues.apache.org/jira/browse/SPARK-8380
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Rick Moritz


On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can perform 
count operations on the entirety of the dataset and get the correct value, as 
double checked against the same code in scala.
When I start to add conditions or even do a simple partial ascending histogram, 
I get discrepancies.

In particular, there are missing values in SparkR, and massively so:
A top 6 count of a certain feature in my dataset results in an order of 
magnitude smaller numbers, than I get via scala.

The following logic, which I consider equivalent is the basis for this report:

counts-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
head(arrange(counts, desc(counts$count)))

versus:

val table = sql(SELECT col_name, count(col_name) as value from df  group by 
col_name order by value desc)

The first, in particular, is taken directly from the SparkR programming guide. 
Since summarize isn't documented from what I can see, I'd hope it does what the 
programming guide indicates. In that case this would be a pretty serious logic 
bug (no errors are thrown). Otherwise, there's the possibility of a lack of 
documentation and badly worded example in the guide being behind my 
misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-15 Thread Mark Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Smith closed SPARK-8322.
-

Thanks for making my first PR so painless guys.

 EC2 script not fully updated for 1.4.0 release
 --

 Key: SPARK-8322
 URL: https://issues.apache.org/jira/browse/SPARK-8322
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Mark Smith
Assignee: Mark Smith
  Labels: easyfix
 Fix For: 1.4.1, 1.5.0


 In the spark_ec2.py script, the 1.4.0 spark version hasn't been added to 
 the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
 break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8380) SparkR mis-counts

2015-06-15 Thread Rick Moritz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586262#comment-14586262
 ] 

Rick Moritz commented on SPARK-8380:


I will attempt to reproduce this with an alternate dataset asap, but getting 
large volume datasets into this cluster is difficult.

 SparkR mis-counts
 -

 Key: SPARK-8380
 URL: https://issues.apache.org/jira/browse/SPARK-8380
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Rick Moritz

 On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can 
 perform count operations on the entirety of the dataset and get the correct 
 value, as double checked against the same code in scala.
 When I start to add conditions or even do a simple partial ascending 
 histogram, I get discrepancies.
 In particular, there are missing values in SparkR, and massively so:
 A top 6 count of a certain feature in my dataset results in an order of 
 magnitude smaller numbers, than I get via scala.
 The following logic, which I consider equivalent is the basis for this report:
 counts-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
 head(arrange(counts, desc(counts$count)))
 versus:
 val table = sql(SELECT col_name, count(col_name) as value from df  group by 
 col_name order by value desc)
 The first, in particular, is taken directly from the SparkR programming 
 guide. Since summarize isn't documented from what I can see, I'd hope it does 
 what the programming guide indicates. In that case this would be a pretty 
 serious logic bug (no errors are thrown). Otherwise, there's the possibility 
 of a lack of documentation and badly worded example in the guide being behind 
 my misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-06-15 Thread Igor Berman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586266#comment-14586266
 ] 

Igor Berman commented on SPARK-4879:


I'm experiencing this issue. Sometimes rdd with 4 partitions is written with 3 
parts and _SUCCESS marker is there.

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 One interesting thing to note about this stack trace: if we look at 
 {{FileOutputCommitter.java:160}} 
 ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]),
  this point in the execution seems to correspond to a case where a task 
 completes, attempts to commit its output, fails for some reason, then deletes 
 the destination file, tries again, and fails:
 {code}
  if (fs.isFile(taskOutput)) {
 152  Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 
 153

[jira] [Commented] (SPARK-8380) SparkR mis-counts

2015-06-15 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586277#comment-14586277
 ] 

Shivaram Venkataraman commented on SPARK-8380:
--

[~RPCMoritz] Couple of things that would be interesting to see 

1. Does the `sql` command in SparkR work correctly ?
2. Can you try the dataframe statements in Scala and see what results you get ?

cc [~rxin]

 SparkR mis-counts
 -

 Key: SPARK-8380
 URL: https://issues.apache.org/jira/browse/SPARK-8380
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Rick Moritz

 On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can 
 perform count operations on the entirety of the dataset and get the correct 
 value, as double checked against the same code in scala.
 When I start to add conditions or even do a simple partial ascending 
 histogram, I get discrepancies.
 In particular, there are missing values in SparkR, and massively so:
 A top 6 count of a certain feature in my dataset results in an order of 
 magnitude smaller numbers, than I get via scala.
 The following logic, which I consider equivalent is the basis for this report:
 counts-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
 head(arrange(counts, desc(counts$count)))
 versus:
 val table = sql(SELECT col_name, count(col_name) as value from df  group by 
 col_name order by value desc)
 The first, in particular, is taken directly from the SparkR programming 
 guide. Since summarize isn't documented from what I can see, I'd hope it does 
 what the programming guide indicates. In that case this would be a pretty 
 serious logic bug (no errors are thrown). Otherwise, there's the possibility 
 of a lack of documentation and badly worded example in the guide being behind 
 my misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8381) reuse-typeConvert when convert Seq[Row] to CatalystType

Lianhui Wang created SPARK-8381:
---

 Summary: reuse-typeConvert when convert Seq[Row] to CatalystType
 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang


This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type


 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-8381:

Summary: reuse typeConvert when convert Seq[Row] to catalyst type  (was: 
reuse-typeConvert when convert Seq[Row] to catalyst type)

 reuse typeConvert when convert Seq[Row] to catalyst type
 

 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang

 This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
 conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8381) reuse-typeConvert when convert Seq[Row] to catalyst type


 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-8381:

Summary: reuse-typeConvert when convert Seq[Row] to catalyst type  (was: 
reuse-typeConvert when convert Seq[Row] to CatalystType)

 reuse-typeConvert when convert Seq[Row] to catalyst type
 

 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang

 This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
 conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-06-15 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586334#comment-14586334
 ] 

Nicholas Chammas commented on SPARK-6220:
-

 please forgive my greenness

No need. Greenness is not a crime around these parts. :)

I suggest creating a new JIRA for that specific feature. In the JIRA you can 
reference this issue here as related.

By the way, I took a look at your commit. If I understood correctly, your 
change associates launched instances with an IAM profile (allowing the launched 
cluster to, for example, access S3 without credentials), but the machine you 
are running spark-ec2 from still needs AWS keys to launch them.

That seems fine to me, but it doesn't sound exactly like what you intended from 
your comment.

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed

[
https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586648#comment-14586648
]

Irina Easterling commented on SPARK-8383:
-

Spark History Server shows Last Updated as 1969/12/31 when SparkPI application
completed
Steps to reproduce:
1. Install Spark thru Ambari Wizard
2. After installation run the Spark Pi Example
3. Navigate to your Spark directory:
baron1:~ # cd /usr/hdp/current/spark-client/
baron1:/usr/hdp/current/spark-client # su spark
spark@baron1:/usr/hdp/current/spark-client spark-submit --verbose --class
org.apache.spark.examples.SparkPi --master yarn-cluster
--num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores
1 lib/spark-examples*.jar 10
4. When job completed.
5. Access to the Ambari Spark Spark History Server UI
6. Click on 'Show incomplete applications' link
7. View the result for completed job
//Results
Last Updated column shows date/time as 1969/12/31 19:00:00 (screenshot attacked)
8. Verify that Spark job completed in YARN. (screenshot attached)

There also discrepancy between SparkHistory Server WebUI and
YARN/ResourceManager WebUI. Spark job completed and it is shown in the
YARN/Resource Manager WebUI. In the SparkHistroyServer WebUI it shows as
Uncompleted.
See attached screenshots.

Spark History Server shows Last Updated as 1969/12/31 when SparkPI
application completed
-

Key: SPARK-8383
URL: https://issues.apache.org/jira/browse/SPARK-8383
Project: Spark
Issue Type: Bug
Components: Spark Core, Web UI
Affects Versions: 1.3.1
Environment: Spark1.3.1.2.3
Reporter: Irina Easterling
Attachments: Spark_WrongLastUpdatedDate.png,
YARN_SparkJobCompleted.PNG

Spark History Server shows Last Updated as 1969/12/31 when SparkPI
application completed and Started Date is 2015/06/10

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed

Irina Easterling created SPARK-8383:
---

 Summary: Spark History Server shows Last Updated as 1969/12/31 
when SparkPI application completed 
 Key: SPARK-8383
 URL: https://issues.apache.org/jira/browse/SPARK-8383
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.3.1
 Environment: Spark 1.3.1.2.3
Reporter: Irina Easterling


Spark History Server shows Last Updated as 1969/12/31 when SparkPI application 
completed and Started Date is 2015/06/10 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed


 [ 
https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Easterling updated SPARK-8383:

Attachment: Spark_WrongLastUpdatedDate.png

 Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
 application completed 
 -

 Key: SPARK-8383
 URL: https://issues.apache.org/jira/browse/SPARK-8383
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.3.1
 Environment: Spark1.3.1.2.3
Reporter: Irina Easterling
 Attachments: Spark_WrongLastUpdatedDate.png


 Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
 application completed and Started Date is 2015/06/10 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-06-15 Thread Daniel LaBar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586616#comment-14586616
 ] 

Daniel LaBar commented on SPARK-6220:
-

Ok, I'll create a new JIRA with a reference to this one.

Thanks for checking the commit.  Our IT security team only gives us AWS keys 
for a service account, but we don't have access to EC2, EMR, S3, etc. from 
this account.  In order to do anything useful we have to switch roles using the 
service account credentials and MFA.  But the Spark EC2 script doesn't seem to 
work with anything other than the AWS key/secret.  So I use the service account 
credentials to create an EC2 instance with an IAM profile that can do useful 
things.  I SSH into that EC2 instance, and then launch the EC2 Spark cluster 
from there using the modified spark_ec2.py script.

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed


 [ 
https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Easterling updated SPARK-8383:

Attachment: YARN_SparkJobCompleted.PNG

 Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
 application completed 
 -

 Key: SPARK-8383
 URL: https://issues.apache.org/jira/browse/SPARK-8383
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.3.1
 Environment: Spark1.3.1.2.3
Reporter: Irina Easterling
 Attachments: Spark_WrongLastUpdatedDate.png, 
 YARN_SparkJobCompleted.PNG


 Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
 application completed and Started Date is 2015/06/10 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5680) Sum function on all null values, should return zero

2015-06-15 Thread Venkata Ramana G (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587473#comment-14587473
 ] 

Venkata Ramana G commented on SPARK-5680:
-

Holman, You are right that column with all NULL values should return NULL.
As my motivation was to fix udaf_number_format.q, select sum('a') from src 
returns 0 in hive, mysql.
 and select cast('a' as double) from src returned NULL in hive.
I assumed or rather wrongly analysed it as Sum of ALL NULLs return 0 and this 
has introduced the problem.
I apologize for this and will submit the patch to revert that fix. 

select sum('a') from src returning 0 in hive and mysql created this 
confusion, is still not clear.


 Sum function on all null values, should return zero
 ---

 Key: SPARK-5680
 URL: https://issues.apache.org/jira/browse/SPARK-5680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Venkata Ramana G
Assignee: Venkata Ramana G
Priority: Minor
 Fix For: 1.3.1, 1.4.0


 SELECT  sum('a'),  avg('a'),  variance('a'),  std('a') FROM src;
 Current output:
 NULL  NULLNULLNULL
 Expected output:
 0.0   NULLNULLNULL
 This fixes hive udaf_number_format.q 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!


[ 
https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587013#comment-14587013
 ] 

Sean Owen commented on SPARK-8335:
--

Go ahead and propose a PR. The sticky issue here is whether it's ok to change 
an experimental API at this point. I think so.

 DecisionTreeModel.predict() return type not convenient!
 ---

 Key: SPARK-8335
 URL: https://issues.apache.org/jira/browse/SPARK-8335
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Sebastian Walz
Priority: Minor
  Labels: easyfix, machine_learning
   Original Estimate: 10m
  Remaining Estimate: 10m

 org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method:
 def predict(features: JavaRDD[Vector]): JavaRDD[Double]
 The problem here is the generic type of the return type JAVARDD[Double] 
 because its a scala Double and I would expect a java.lang.Double. (to be 
 convenient e.g. with 
 org.apache.spark.mllib.classification.ClassificationModel)
 I wanted to extend the DecisionTreeModel and use it only for Binary 
 Classification and wanted to implement the trait 
 org.apache.spark.mllib.classification.ClassificationModel . But its not 
 possible because the ClassificationModel already defines the predict method 
 but with an return type JAVARDD[java.lang.Double]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8370) Add API for data sources to register databases

Santiago M. Mola created SPARK-8370:
---

 Summary: Add API for data sources to register databases
 Key: SPARK-8370
 URL: https://issues.apache.org/jira/browse/SPARK-8370
 Project: Spark
  Issue Type: New Feature
Reporter: Santiago M. Mola


This API would allow to register a database with a data source instead of just 
a table. Registering a data source database would register all its table and 
maintain the catalog updated. The catalog could delegate to the data source 
lookups of tables for a database registered with this API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8371) improve unit test for MaxOf and MinOf

2015-06-15 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-8371:
--

 Summary: improve unit test for MaxOf and MinOf
 Key: SPARK-8371
 URL: https://issues.apache.org/jira/browse/SPARK-8371
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8370) Add API for data sources to register databases


 [ 
https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santiago M. Mola updated SPARK-8370:

Component/s: SQL

 Add API for data sources to register databases
 --

 Key: SPARK-8370
 URL: https://issues.apache.org/jira/browse/SPARK-8370
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Santiago M. Mola

 This API would allow to register a database with a data source instead of 
 just a table. Registering a data source database would register all its table 
 and maintain the catalog updated. The catalog could delegate to the data 
 source lookups of tables for a database registered with this API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8348) Add in operator to DataFrame Column


[ 
https://issues.apache.org/jira/browse/SPARK-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585613#comment-14585613
 ] 

Apache Spark commented on SPARK-8348:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/6824

 Add in operator to DataFrame Column
 ---

 Key: SPARK-8348
 URL: https://issues.apache.org/jira/browse/SPARK-8348
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Xiangrui Meng

 It is convenient to add in operator to column, so we can filter values in a 
 set.
 {code}
 df.filter(col(brand).in(dell, sony))
 {code}
 In R, the operator should be `%in%`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8348) Add in operator to DataFrame Column


 [ 
https://issues.apache.org/jira/browse/SPARK-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8348:
---

Assignee: (was: Apache Spark)

 Add in operator to DataFrame Column
 ---

 Key: SPARK-8348
 URL: https://issues.apache.org/jira/browse/SPARK-8348
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Xiangrui Meng

 It is convenient to add in operator to column, so we can filter values in a 
 set.
 {code}
 df.filter(col(brand).in(dell, sony))
 {code}
 In R, the operator should be `%in%`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8348) Add in operator to DataFrame Column


 [ 
https://issues.apache.org/jira/browse/SPARK-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8348:
---

Assignee: Apache Spark

 Add in operator to DataFrame Column
 ---

 Key: SPARK-8348
 URL: https://issues.apache.org/jira/browse/SPARK-8348
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Xiangrui Meng
Assignee: Apache Spark

 It is convenient to add in operator to column, so we can filter values in a 
 set.
 {code}
 df.filter(col(brand).in(dell, sony))
 {code}
 In R, the operator should be `%in%`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8371) improve unit test for MaxOf and MinOf


 [ 
https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8371:
---

Assignee: Apache Spark

 improve unit test for MaxOf and MinOf
 -

 Key: SPARK-8371
 URL: https://issues.apache.org/jira/browse/SPARK-8371
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8371) improve unit test for MaxOf and MinOf


[ 
https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585615#comment-14585615
 ] 

Apache Spark commented on SPARK-8371:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/6825

 improve unit test for MaxOf and MinOf
 -

 Key: SPARK-8371
 URL: https://issues.apache.org/jira/browse/SPARK-8371
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8371) improve unit test for MaxOf and MinOf