[jira] [Commented] (SPARK-2172) PySpark cannot import mllib modules in YARN-client mode

2014-08-25 Thread Joao Salcedo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108783#comment-14108783
 ] 

Joao Salcedo commented on SPARK-2172:
-

[~piotrszul] For the fix there is a workaround that I can use in my python 
script ?

 PySpark cannot import mllib modules in YARN-client mode
 ---

 Key: SPARK-2172
 URL: https://issues.apache.org/jira/browse/SPARK-2172
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark, Spark Core, YARN
Affects Versions: 1.0.0, 1.1.0
 Environment: Ubuntu 14.04
 Java 7
 Python 2.7
 CDH 5.0.2 (Hadoop 2.3.0): HDFS, YARN
 Spark 1.0.0 and git master
Reporter: Vlad Frolov
  Labels: mllib, python
 Fix For: 1.0.1, 1.1.0


 Here is the simple reproduce code:
 {noformat}
 $ HADOOP_CONF_DIR=/etc/hadoop/conf MASTER=yarn-client ./bin/pyspark
 {noformat}
 {code:title=issue.py|borderStyle=solid}
  from pyspark.mllib.regression import LabeledPoint
  sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).count()
 {code}
 Note: The same issue occurs with .collect() instead of .count()
 {code:title=TraceBack|borderStyle=solid}
 Py4JJavaError: An error occurred while calling o110.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 8.0:0 failed 4 times, most recent failure: Exception failure in TID 52 on 
 host ares: org.apache.spark.api.python.PythonException: Traceback (most 
 recent call last):
   File 
 /mnt/storage/bigisle/yarn/1/yarn/local/usercache/blb/filecache/18/spark-assembly-1.0.0-hadoop2.2.0.jar/pyspark/worker.py,
  line 73, in main
 command = pickleSer._read_with_length(infile)
   File 
 /mnt/storage/bigisle/yarn/1/yarn/local/usercache/blb/filecache/18/spark-assembly-1.0.0-hadoop2.2.0.jar/pyspark/serializers.py,
  line 146, in _read_with_length
 return self.loads(obj)
 ImportError: No module named mllib.regression
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 However, this code works as expected:
 {code:title=noissue.py|borderStyle=solid}
  from pyspark.mllib.regression import LabeledPoint
  sc.parallelize([1,2,3]).map(lambda x: LabeledPoint(1, [2])).first()
  

[jira] [Commented] (SPARK-3190) Creation of large graph( 2.15 B nodes) seems to be broken:possible overflow somewhere

2014-08-25 Thread npanj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108794#comment-14108794
 ] 

npanj commented on SPARK-3190:
--

Thanks Ankur for patch. I can confirm that this pull request fixed the issue.

 Creation of large graph( 2.15 B nodes) seems to be broken:possible overflow 
 somewhere 
 ---

 Key: SPARK-3190
 URL: https://issues.apache.org/jira/browse/SPARK-3190
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.3
 Environment: Standalone mode running on EC2 . Using latest code from 
 master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
Reporter: npanj
Assignee: Ankur Dave
Priority: Critical

 While creating a graph with 6B nodes and 12B edges, I noticed that 
 'numVertices' api returns incorrect result; 'numEdges' reports correct 
 number. For few times(with different dataset  2.5B nodes) I have also 
 notices that numVertices is returned as -ive number; so I suspect that there 
 is some overflow (may be we are using Int for some field?).
 Here is some details of experiments  I have done so far: 
 1. Input: numNodes=6101995593 ; noEdges=12163784626
Graph returns: numVertices=1807028297 ;  numEdges=12163784626
 2. Input : numNodes=2157586441 ; noEdges=2747322705
Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
 3. Input: numNodes=1725060105 ; noEdges=204176821
Graph: numVertices=1725060105 ;  numEdges=2041768213
 You can find the code to generate this bug here: 
 https://gist.github.com/npanj/92e949d86d08715bf4bf
 Note: Nodes are labeled are 1...6B .
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2805) update akka to version 2.3

2014-08-25 Thread Anand Avati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108800#comment-14108800
 ] 

Anand Avati commented on SPARK-2805:


[~pwendell] ping

 update akka to version 2.3
 --

 Key: SPARK-2805
 URL: https://issues.apache.org/jira/browse/SPARK-2805
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati

 akka-2.3 is the lowest version available in Scala 2.11
 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order 
 to reconcile the conflicting dependencies, need to release 
 akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3197) Reduce the expression tree object creation from the aggregation functions (min/max)

2014-08-25 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-3197:


 Summary: Reduce the expression tree object creation from the 
aggregation functions (min/max)
 Key: SPARK-3197
 URL: https://issues.apache.org/jira/browse/SPARK-3197
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3196) Expression Evaluation Performance Improvement

2014-08-25 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-3196:


 Summary: Expression Evaluation Performance Improvement
 Key: SPARK-3196
 URL: https://issues.apache.org/jira/browse/SPARK-3196
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


The expression id generations depend on a atomic long object internally, which 
will cause the performance drop dramatically in a multi-threading execution.

I'd like to create 2 sub tasks(maybe more) for the improvements:

1) Reduce the expression tree object creation from the aggregation functions 
(min/max), as they will create expression trees for each single row.
2) Improve the expression id generation algorithm, by not using the AtomicLong.

And remove the expression object creation as many as possible, where we have 
the expression evaluation. (I will create couple of subtask soon).





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3198) Improve the expression id generation algorithm

2014-08-25 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-3198:


 Summary: Improve the expression id generation algorithm
 Key: SPARK-3198
 URL: https://issues.apache.org/jira/browse/SPARK-3198
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao


Currently, Catalyst harnesses the AtomicLong for the expression id generation 
algorithm, which reduce the performance dramatically in a multithread env.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3193) output errer info when Process exitcode not zero

2014-08-25 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-3193.


Resolution: Invalid

[Verbatim from my comment on PR  ]

Hey !, Thanks for raising this concern.

The convention in spark is that we look in the 
[sub-project]/target/unit-tests.log. And this is applicable to all test suits. 
So when you saw a particular test fail on the Jenkins, you can rerun that test 
locally and then check that unit-tests.log file for that sub-project.

I hope this helps. You can close this PR if you are convinced.

P.S: May be we can expand our wiki page with this information.


 output errer info when Process exitcode not zero
 

 Key: SPARK-3193
 URL: https://issues.apache.org/jira/browse/SPARK-3193
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: wangfei

 I noticed that sometimes pr tests failed due to the Process exitcode != 0:
 DriverSuite: 
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath 
 - driver should exit after finishing *** FAILED *** 
SparkException was thrown during property evaluation. 
 (DriverSuite.scala:40) 
  Message: Process List(./bin/spark-class, 
 org.apache.spark.DriverWithoutCleanup, local) exited with code 1 
  Occurred at table row 0 (zero based, not counting headings), which had 
 values ( 
master = local 
  ) 
  
 [info] SparkSubmitSuite:
 [info] - prints usage on empty input
 [info] - prints usage with only --help
 [info] - prints error with unrecognized options
 [info] - handle binary specified but not class
 [info] - handles arguments with --key=val
 [info] - handles arguments to user program
 [info] - handles arguments to user program with name collision
 [info] - handles YARN cluster mode
 [info] - handles YARN client mode
 [info] - handles standalone cluster mode
 [info] - handles standalone client mode
 [info] - handles mesos client mode
 [info] - handles confs with flag equivalents
 [info] - launch simple application with spark-submit *** FAILED ***
 [info]   org.apache.spark.SparkException: Process List(./bin/spark-submit, 
 --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, 
 --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited 
 with code 1
 [info]   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872)
 [info]   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
 [info]   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
 [info]   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
 [info]   at org.apacSpark assembly has been built with Hive, including 
 Datanucleus jars on classpath
 refer to 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull
 we should output the process error info when failed, this can be helpful for 
 diagnosis.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3197) Reduce the expression tree object creation from the aggregation functions (min/max)

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108869#comment-14108869
 ] 

Apache Spark commented on SPARK-3197:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2113

 Reduce the expression tree object creation from the aggregation functions 
 (min/max)
 ---

 Key: SPARK-3197
 URL: https://issues.apache.org/jira/browse/SPARK-3197
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3199) native Java spark listener API support

2014-08-25 Thread Chengxiang Li (JIRA)
Chengxiang Li created SPARK-3199:


 Summary: native Java spark listener API support
 Key: SPARK-3199
 URL: https://issues.apache.org/jira/browse/SPARK-3199
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li


Current spark listener API is totally scala style, full of case classes and 
scala collections, a native Java spark listener API would be much friendly for 
Java users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information

2014-08-25 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated SPARK-2633:
-

Summary: enhance spark listener API to gather more spark job information  
(was: support register spark listener to listener bus with Java API)

 enhance spark listener API to gather more spark job information
 ---

 Key: SPARK-2633
 URL: https://issues.apache.org/jira/browse/SPARK-2633
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
Priority: Critical
  Labels: hive
 Attachments: Spark listener enhancement for Hive on Spark job monitor 
 and statistic.docx


 Currently user can only register spark listener with Scala API, we should add 
 this feature to Java API as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information

2014-08-25 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated SPARK-2633:
-

Description: Based on Hive on Spark job status monitoring and statistic 
collection requirement, try to enhance spark listener API to gather more spark 
job information.  (was: Currently user can only register spark listener with 
Scala API, we should add this feature to Java API as well.)

 enhance spark listener API to gather more spark job information
 ---

 Key: SPARK-2633
 URL: https://issues.apache.org/jira/browse/SPARK-2633
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
Priority: Critical
  Labels: hive
 Attachments: Spark listener enhancement for Hive on Spark job monitor 
 and statistic.docx


 Based on Hive on Spark job status monitoring and statistic collection 
 requirement, try to enhance spark listener API to gather more spark job 
 information.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3198) Generates the expression id while necessary

2014-08-25 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-3198:
-

Summary: Generates the expression id while necessary  (was: Improve the 
expression id generation algorithm)

 Generates the expression id while necessary
 ---

 Key: SPARK-3198
 URL: https://issues.apache.org/jira/browse/SPARK-3198
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, Catalyst harnesses the AtomicLong for the expression id generation 
 algorithm, which reduce the performance dramatically in a multithread env.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2633) enhance spark listener API to gather more spark job information

2014-08-25 Thread Chengxiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108887#comment-14108887
 ] 

Chengxiang Li commented on SPARK-2633:
--

I would start to work on this issue, for better isolation, this JIRA would 
focus on spark listener API enhancement, and I've created SPARK-3199 to track 
native Spark listener API implementation.

 enhance spark listener API to gather more spark job information
 ---

 Key: SPARK-2633
 URL: https://issues.apache.org/jira/browse/SPARK-2633
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
Priority: Critical
  Labels: hive
 Attachments: Spark listener enhancement for Hive on Spark job monitor 
 and statistic.docx


 Based on Hive on Spark job status monitoring and statistic collection 
 requirement, try to enhance spark listener API to gather more spark job 
 information.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3196) Expression Evaluation Performance Improvement

2014-08-25 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-3196:
-

Description: 
The expression id generations depend on a atomic long object internally, which 
will cause the performance drop dramatically in a multi-threading execution.

I'd like to create 2 sub tasks(maybe more) for the improvements:

1) Reduce the expression tree object creation from the aggregation functions 
(min/max), as they will create expression trees for each single row.
2) Improve the expression id generation algorithm, by not using the AtomicLong, 
or generate the expression id in necessary.

And remove the expression object creation as many as possible, where we have 
the expression evaluation. (I will create couple of subtask soon).



  was:
The expression id generations depend on a atomic long object internally, which 
will cause the performance drop dramatically in a multi-threading execution.

I'd like to create 2 sub tasks(maybe more) for the improvements:

1) Reduce the expression tree object creation from the aggregation functions 
(min/max), as they will create expression trees for each single row.
2) Improve the expression id generation algorithm, by not using the AtomicLong.

And remove the expression object creation as many as possible, where we have 
the expression evaluation. (I will create couple of subtask soon).




 Expression Evaluation Performance Improvement
 -

 Key: SPARK-3196
 URL: https://issues.apache.org/jira/browse/SPARK-3196
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao

 The expression id generations depend on a atomic long object internally, 
 which will cause the performance drop dramatically in a multi-threading 
 execution.
 I'd like to create 2 sub tasks(maybe more) for the improvements:
 1) Reduce the expression tree object creation from the aggregation functions 
 (min/max), as they will create expression trees for each single row.
 2) Improve the expression id generation algorithm, by not using the 
 AtomicLong, or generate the expression id in necessary.
 And remove the expression object creation as many as possible, where we have 
 the expression evaluation. (I will create couple of subtask soon).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3198) Generates the expression id while necessary

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108891#comment-14108891
 ] 

Apache Spark commented on SPARK-3198:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2114

 Generates the expression id while necessary
 ---

 Key: SPARK-3198
 URL: https://issues.apache.org/jira/browse/SPARK-3198
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, Catalyst harnesses the AtomicLong for the expression id generation 
 algorithm, which reduce the performance dramatically in a multithread env.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3198) Generates the expression id while necessary

2014-08-25 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108894#comment-14108894
 ] 

Cheng Hao commented on SPARK-3198:
--

Usually, we need the expression id in logical plan analyzing, not in 
evaluation, hence we can get significant improvement by doing this.

 Generates the expression id while necessary
 ---

 Key: SPARK-3198
 URL: https://issues.apache.org/jira/browse/SPARK-3198
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, Catalyst harnesses the AtomicLong for the expression id generation 
 algorithm, which reduce the performance dramatically in a multithread env.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3173) Timestamp support in the parser

2014-08-25 Thread Teng Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108936#comment-14108936
 ] 

Teng Qiu commented on SPARK-3173:
-

vote on this ticket, it seems should link to this PR 
https://github.com/apache/spark/pull/2084

 Timestamp support in the parser
 ---

 Key: SPARK-3173
 URL: https://issues.apache.org/jira/browse/SPARK-3173
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
Reporter: Zdenek Farana

 If you have a table with TIMESTAMP column, that column can't be used in WHERE 
 clause properly - it is not evaluated properly.
 F.e., SELECT * FROM a WHERE timestamp='2014-08-21 00:00:00.0', would return 
 nothing even if there would be a row with such a timestamp. The literal is 
 not interpreted into a timestamp.
 The workaround SELECT * FROM a WHERE timestamp=CAST('2014-08-21 00:00:00.0' 
 AS TIMESTAMP) fails, because the parser does not allow anything but STRING in 
 the CAST dataType expression.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-08-25 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108958#comment-14108958
 ] 

Prashant Sharma commented on SPARK-2620:


I just tried these test code snippets on the Spark Repl(built from master), and 
they pass with expected results.

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3200) Class defined with reference to external variables crashes in REPL.

2014-08-25 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-3200:
--

 Summary: Class defined with reference to external variables 
crashes in REPL.
 Key: SPARK-3200
 URL: https://issues.apache.org/jira/browse/SPARK-3200
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Prashant Sharma


Reproducer:
{noformat}
val a = sc.textFile(README.md).count
case class A(i: Int) { val j = a} 
sc.parallelize(1 to 10).map(A(_)).collect()
{noformat}
This will happen, when one refers something that refers sc and not otherwise. 
There are many ways to work around this, like directly assign a constant value 
instead of referring the variable. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3201) Yarn Client do not support the -X java opts

2014-08-25 Thread hzw (JIRA)
hzw created SPARK-3201:
--

 Summary: Yarn Client do not support the -X java opts
 Key: SPARK-3201
 URL: https://issues.apache.org/jira/browse/SPARK-3201
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: hzw


In yarn-client mode, it's not allowed to set the 
spark.driver.extraJavaOptions .
I think it's very inconvenient if we want to set the -X java opts in the 
process of ExecutorLauncher.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3203) ClassNotFound Exception

2014-08-25 Thread Rohit Kumar (JIRA)
Rohit Kumar created SPARK-3203:
--

 Summary: ClassNotFound Exception
 Key: SPARK-3203
 URL: https://issues.apache.org/jira/browse/SPARK-3203
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Ubuntu 12.04, openjdk 64 bit 7u65
Reporter: Rohit Kumar


I am using Spark with as processing engine over cassandra. I have only one 
master and a worker node. 

I am executing   following code in spark-shell :

sc.stop
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkConf
import com.datastax.spark.connector._


val conf = new SparkConf(true).set(spark.cassandra.connection.host, 
127.0.0.1)
val sc = new SparkContext(spark://L-BXP44Z1:7077, Cassandra Connector Test, 
conf)
 val rdd = sc.cassandraTable(test, kv)
println(rdd.map(_.getInt(value)).sum) 

I am getting following error:


14/08/25 18:47:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
14/08/25 18:49:39 INFO CoarseGrainedExecutorBackend: Got assigned task 0
14/08/25 18:49:39 INFO Executor: Running task ID 0
14/08/25 18:49:39 ERROR Executor: Exception in task ID 0
java.lang.ClassNotFoundException: 
$line29.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at 
org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
at 
org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
at 

[jira] [Created] (SPARK-3202) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Hingorani, Vineet (JIRA)
Hingorani, Vineet created SPARK-3202:


 Summary: Manipulating columns in CSV file or Transpose of 
Array[Array[String]] RDD
 Key: SPARK-3202
 URL: https://issues.apache.org/jira/browse/SPARK-3202
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Hingorani, Vineet


Hello all,

Could someone help me with the manipulation of csv file data. I have 
'semicolon' separated csv data including doubles and strings. I want to 
calculate the maximum/average of a column. When I read the file using 
sc.textFile(test.csv).map(_.split(;), each field is read as string. Could 
someone help me with the above manipulation and how to do that.

Or may be if there is some way to take the transpose of the data and then 
manipulating the rows in some way?

Thank you in advance, I am struggling with this thing for quite sometime

Regards,
Vineet



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3202) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Hingorani, Vineet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hingorani, Vineet closed SPARK-3202.


Resolution: Invalid

 Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD
 -

 Key: SPARK-3202
 URL: https://issues.apache.org/jira/browse/SPARK-3202
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Hingorani, Vineet

 Hello all,
 Could someone help me with the manipulation of csv file data. I have 
 'semicolon' separated csv data including doubles and strings. I want to 
 calculate the maximum/average of a column. When I read the file using 
 sc.textFile(test.csv).map(_.split(;), each field is read as string. Could 
 someone help me with the above manipulation and how to do that.
 Or may be if there is some way to take the transpose of the data and then 
 manipulating the rows in some way?
 Thank you in advance, I am struggling with this thing for quite sometime
 Regards,
 Vineet



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3202) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Hingorani, Vineet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109129#comment-14109129
 ] 

Hingorani, Vineet commented on SPARK-3202:
--

Thank you Sean for the helping regarding the platform. :)

 Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD
 -

 Key: SPARK-3202
 URL: https://issues.apache.org/jira/browse/SPARK-3202
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Hingorani, Vineet

 Hello all,
 Could someone help me with the manipulation of csv file data. I have 
 'semicolon' separated csv data including doubles and strings. I want to 
 calculate the maximum/average of a column. When I read the file using 
 sc.textFile(test.csv).map(_.split(;), each field is read as string. Could 
 someone help me with the above manipulation and how to do that.
 Or may be if there is some way to take the transpose of the data and then 
 manipulating the rows in some way?
 Thank you in advance, I am struggling with this thing for quite sometime
 Regards,
 Vineet



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3202) Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD

2014-08-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109128#comment-14109128
 ] 

Sean Owen commented on SPARK-3202:
--

JIRA is not a good place to ask questions -- please use u...@spark.apache.org. 
This is for reporting issues, so I'd recommend closing this.

 Manipulating columns in CSV file or Transpose of Array[Array[String]] RDD
 -

 Key: SPARK-3202
 URL: https://issues.apache.org/jira/browse/SPARK-3202
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Hingorani, Vineet

 Hello all,
 Could someone help me with the manipulation of csv file data. I have 
 'semicolon' separated csv data including doubles and strings. I want to 
 calculate the maximum/average of a column. When I read the file using 
 sc.textFile(test.csv).map(_.split(;), each field is read as string. Could 
 someone help me with the above manipulation and how to do that.
 Or may be if there is some way to take the transpose of the data and then 
 manipulating the rows in some way?
 Thank you in advance, I am struggling with this thing for quite sometime
 Regards,
 Vineet



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3201) Yarn Client do not support the -X java opts

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109131#comment-14109131
 ] 

Apache Spark commented on SPARK-3201:
-

User 'hzw19900416' has created a pull request for this issue:
https://github.com/apache/spark/pull/2115

 Yarn Client do not support the -X java opts
 -

 Key: SPARK-3201
 URL: https://issues.apache.org/jira/browse/SPARK-3201
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: hzw

 In yarn-client mode, it's not allowed to set the 
 spark.driver.extraJavaOptions .
 I think it's very inconvenient if we want to set the -X java opts in the 
 process of ExecutorLauncher.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3204) MaxOf would be foldable if both left and right are foldable.

2014-08-25 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-3204:


 Summary: MaxOf would be foldable if both left and right are 
foldable.
 Key: SPARK-3204
 URL: https://issues.apache.org/jira/browse/SPARK-3204
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3206) Error in PageRank values

2014-08-25 Thread Peter Fontana (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Fontana updated SPARK-3206:
-

Description: 
I have found a small example where the PageRank values using run and 
runUntilConvergence differ quite a bit.

I am running the Pagerank module on the following graph:

Edge Table:

| Node1  | Node2  |
|1 | 2 |
|1 |3|
|3 |2|
|3 |4|
|5 |3|
|6 |7|
|7 |8|
|8 |9|
|9 |7|

Node Table (note the extra node):

| NodeID  | NodeName  |
|a |1|
|b |2|
|c |3|
|d |4|
|e |5|
|f |6|
|g |7|
|h |8|
|i |9|
|j.longaddress.com |10|

with a default resetProb of 0.15.
When I compute the pageRank with runUntilConvergence, running 

```
 val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices
```
I get the ranks
(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,1.3299054047985106)
(9,1.2381240056453071)
(8,1.2803346052504254)
(10,0.15)
(5,0.15)
(2,0.358781244)

However, when I run page Rank with the run() method, running  val ranksI = 
PageRank.run(graph,100).vertices I get the page ranks

(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,0.999387662847)
(9,0.999256447741)
(8,0.999256447741)
(10,0.15)
(5,0.15)
(2,0.295031247)

These are quite different, leading me to suspect that one of the PageRank 
methods is incorrect. I have examined the source, but I do not know what the 
correct fix is, or which set of values is correct.

  was:
I have found a small example where the PageRank values using run and 
runUntilConvergence differ quite a bit.

I am running the Pagerank module on the following graph:

Edge Table:

| Node1  | Node2  |
|1 | 2 |
|1 |3|
3 | 2
3 | 4
5 | 3
6 | 7
7 | 8
8 | 9
9 | 7

Node Table (note the extra node):

| NodeID  | NodeName  |
| - | - |
a | 1
b | 2
c | 3
d | 4
e | 5
f | 6
g | 7
h | 8
i | 9
j.longaddress.com | 10

with a default resetProb of 0.15.
When I compute the pageRank with runUntilConvergence, running  val ranks = 
PageRank.runUntilConvergence(graph,0.0001).vertices

I get the ranks
(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,1.3299054047985106)
(9,1.2381240056453071)
(8,1.2803346052504254)
(10,0.15)
(5,0.15)
(2,0.358781244)

However, when I run page Rank with the run() method, running  val ranksI = 
PageRank.run(graph,100).vertices I get the page ranks

(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,0.999387662847)
(9,0.999256447741)
(8,0.999256447741)
(10,0.15)
(5,0.15)
(2,0.295031247)

These are quite different, leading me to suspect that one of the PageRank 
methods is incorrect. I have examined the source, but I do not know what the 
correct fix is, or which set of values is correct.


 Error in PageRank values
 

 Key: SPARK-3206
 URL: https://issues.apache.org/jira/browse/SPARK-3206
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
 Environment: UNIX with Hadoop
Reporter: Peter Fontana

 I have found a small example where the PageRank values using run and 
 runUntilConvergence differ quite a bit.
 I am running the Pagerank module on the following graph:
 Edge Table:
 | Node1  | Node2  |
 |1 | 2 |
 |1 |  3|
 |3 |  2|
 |3 |  4|
 |5 |  3|
 |6 |  7|
 |7 |  8|
 |8 |  9|
 |9 |  7|
 Node Table (note the extra node):
 | NodeID  | NodeName  |
 |a |  1|
 |b |  2|
 |c |  3|
 |d |  4|
 |e |  5|
 |f |  6|
 |g |  7|
 |h |  8|
 |i |  9|
 |j.longaddress.com |  10|
 with a default resetProb of 0.15.
 When I compute the pageRank with runUntilConvergence, running 
 ```
  val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices
 ```
 I get the ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,1.3299054047985106)
 (9,1.2381240056453071)
 (8,1.2803346052504254)
 (10,0.15)
 (5,0.15)
 (2,0.358781244)
 However, when I run page Rank with the run() method, running  val ranksI = 
 PageRank.run(graph,100).vertices I get the page ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,0.999387662847)
 (9,0.999256447741)
 (8,0.999256447741)
 (10,0.15)
 (5,0.15)
 (2,0.295031247)
 These are quite different, leading me to suspect that one of the PageRank 
 methods is incorrect. I have examined the source, but I do not know what the 
 correct fix is, or which set of values is correct.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3206) Error in PageRank values

2014-08-25 Thread Peter Fontana (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Fontana updated SPARK-3206:
-

Description: 
I have found a small example where the PageRank values using run and 
runUntilConvergence differ quite a bit.

I am running the Pagerank module on the following graph:

Edge Table:

| |Node1|||Node2 | |
|1 | 2 |
|1 |3|
|3 |2|
|3 |4|
|5 |3|
|6 |7|
|7 |8|
|8 |9|
|9 |7|

Node Table (note the extra node):

|| NodeID  || NodeName  ||
|a |1|
|b |2|
|c |3|
|d |4|
|e |5|
|f |6|
|g |7|
|h |8|
|i |9|
|j.longaddress.com |10|

with a default resetProb of 0.15.
When I compute the pageRank with runUntilConvergence, running 

{{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}

I get the ranks
(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,1.3299054047985106)
(9,1.2381240056453071)
(8,1.2803346052504254)
(10,0.15)
(5,0.15)
(2,0.358781244)

However, when I run page Rank with the run() method, running  

{{val ranksI = PageRank.run(graph,100).vertices}} 

I get the page ranks

(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,0.999387662847)
(9,0.999256447741)
(8,0.999256447741)
(10,0.15)
(5,0.15)
(2,0.295031247)

These are quite different, leading me to suspect that one of the PageRank 
methods is incorrect. I have examined the source, but I do not know what the 
correct fix is, or which set of values is correct.

  was:
I have found a small example where the PageRank values using run and 
runUntilConvergence differ quite a bit.

I am running the Pagerank module on the following graph:

Edge Table:

| |Node1|||Node2 | |
|1 | 2 |
|1 |3|
|3 |2|
|3 |4|
|5 |3|
|6 |7|
|7 |8|
|8 |9|
|9 |7|

Node Table (note the extra node):

|| NodeID  || NodeName  ||
|a |1|
|b |2|
|c |3|
|d |4|
|e |5|
|f |6|
|g |7|
|h |8|
|i |9|
|j.longaddress.com |10|

with a default resetProb of 0.15.
When I compute the pageRank with runUntilConvergence, running 

{{ val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}

I get the ranks
(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,1.3299054047985106)
(9,1.2381240056453071)
(8,1.2803346052504254)
(10,0.15)
(5,0.15)
(2,0.358781244)

However, when I run page Rank with the run() method, running  

{{val ranksI = PageRank.run(graph,100).vertices}} 

I get the page ranks

(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,0.999387662847)
(9,0.999256447741)
(8,0.999256447741)
(10,0.15)
(5,0.15)
(2,0.295031247)

These are quite different, leading me to suspect that one of the PageRank 
methods is incorrect. I have examined the source, but I do not know what the 
correct fix is, or which set of values is correct.


 Error in PageRank values
 

 Key: SPARK-3206
 URL: https://issues.apache.org/jira/browse/SPARK-3206
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
 Environment: UNIX with Hadoop
Reporter: Peter Fontana

 I have found a small example where the PageRank values using run and 
 runUntilConvergence differ quite a bit.
 I am running the Pagerank module on the following graph:
 Edge Table:
 | |Node1|||Node2 | |
 |1 | 2 |
 |1 |  3|
 |3 |  2|
 |3 |  4|
 |5 |  3|
 |6 |  7|
 |7 |  8|
 |8 |  9|
 |9 |  7|
 Node Table (note the extra node):
 || NodeID  || NodeName  ||
 |a |  1|
 |b |  2|
 |c |  3|
 |d |  4|
 |e |  5|
 |f |  6|
 |g |  7|
 |h |  8|
 |i |  9|
 |j.longaddress.com |  10|
 with a default resetProb of 0.15.
 When I compute the pageRank with runUntilConvergence, running 
 {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}
 I get the ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,1.3299054047985106)
 (9,1.2381240056453071)
 (8,1.2803346052504254)
 (10,0.15)
 (5,0.15)
 (2,0.358781244)
 However, when I run page Rank with the run() method, running  
 {{val ranksI = PageRank.run(graph,100).vertices}} 
 I get the page ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,0.999387662847)
 (9,0.999256447741)
 (8,0.999256447741)
 (10,0.15)
 (5,0.15)
 (2,0.295031247)
 These are quite different, leading me to suspect that one of the PageRank 
 methods is incorrect. I have examined the source, but I do not know what the 
 correct fix is, or which set of values is correct.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3206) Error in PageRank values

2014-08-25 Thread Peter Fontana (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Fontana updated SPARK-3206:
-

Description: 
I have found a small example where the PageRank values using run and 
runUntilConvergence differ quite a bit.

I am running the Pagerank module on the following graph:

Edge Table:

| |Node1|||Node2 | |
|1 | 2 |
|1 |3|
|3 |2|
|3 |4|
|5 |3|
|6 |7|
|7 |8|
|8 |9|
|9 |7|

Node Table (note the extra node):

|| NodeID  || NodeName  ||
|a |1|
|b |2|
|c |3|
|d |4|
|e |5|
|f |6|
|g |7|
|h |8|
|i |9|
|j.longaddress.com |10|

with a default resetProb of 0.15.
When I compute the pageRank with runUntilConvergence, running 

{{ val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}

I get the ranks
(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,1.3299054047985106)
(9,1.2381240056453071)
(8,1.2803346052504254)
(10,0.15)
(5,0.15)
(2,0.358781244)

However, when I run page Rank with the run() method, running  

{{val ranksI = PageRank.run(graph,100).vertices}} 

I get the page ranks

(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,0.999387662847)
(9,0.999256447741)
(8,0.999256447741)
(10,0.15)
(5,0.15)
(2,0.295031247)

These are quite different, leading me to suspect that one of the PageRank 
methods is incorrect. I have examined the source, but I do not know what the 
correct fix is, or which set of values is correct.

  was:
I have found a small example where the PageRank values using run and 
runUntilConvergence differ quite a bit.

I am running the Pagerank module on the following graph:

Edge Table:

| Node1  | Node2  |
|1 | 2 |
|1 |3|
|3 |2|
|3 |4|
|5 |3|
|6 |7|
|7 |8|
|8 |9|
|9 |7|

Node Table (note the extra node):

| NodeID  | NodeName  |
|a |1|
|b |2|
|c |3|
|d |4|
|e |5|
|f |6|
|g |7|
|h |8|
|i |9|
|j.longaddress.com |10|

with a default resetProb of 0.15.
When I compute the pageRank with runUntilConvergence, running 

```
 val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices
```
I get the ranks
(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,1.3299054047985106)
(9,1.2381240056453071)
(8,1.2803346052504254)
(10,0.15)
(5,0.15)
(2,0.358781244)

However, when I run page Rank with the run() method, running  val ranksI = 
PageRank.run(graph,100).vertices I get the page ranks

(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,0.999387662847)
(9,0.999256447741)
(8,0.999256447741)
(10,0.15)
(5,0.15)
(2,0.295031247)

These are quite different, leading me to suspect that one of the PageRank 
methods is incorrect. I have examined the source, but I do not know what the 
correct fix is, or which set of values is correct.


 Error in PageRank values
 

 Key: SPARK-3206
 URL: https://issues.apache.org/jira/browse/SPARK-3206
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
 Environment: UNIX with Hadoop
Reporter: Peter Fontana

 I have found a small example where the PageRank values using run and 
 runUntilConvergence differ quite a bit.
 I am running the Pagerank module on the following graph:
 Edge Table:
 | |Node1|||Node2 | |
 |1 | 2 |
 |1 |  3|
 |3 |  2|
 |3 |  4|
 |5 |  3|
 |6 |  7|
 |7 |  8|
 |8 |  9|
 |9 |  7|
 Node Table (note the extra node):
 || NodeID  || NodeName  ||
 |a |  1|
 |b |  2|
 |c |  3|
 |d |  4|
 |e |  5|
 |f |  6|
 |g |  7|
 |h |  8|
 |i |  9|
 |j.longaddress.com |  10|
 with a default resetProb of 0.15.
 When I compute the pageRank with runUntilConvergence, running 
 {{ val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}
 I get the ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,1.3299054047985106)
 (9,1.2381240056453071)
 (8,1.2803346052504254)
 (10,0.15)
 (5,0.15)
 (2,0.358781244)
 However, when I run page Rank with the run() method, running  
 {{val ranksI = PageRank.run(graph,100).vertices}} 
 I get the page ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,0.999387662847)
 (9,0.999256447741)
 (8,0.999256447741)
 (10,0.15)
 (5,0.15)
 (2,0.295031247)
 These are quite different, leading me to suspect that one of the PageRank 
 methods is incorrect. I have examined the source, but I do not know what the 
 correct fix is, or which set of values is correct.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3206) Error in PageRank values

2014-08-25 Thread Peter Fontana (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Fontana updated SPARK-3206:
-

Description: 
I have found a small example where the PageRank values using run and 
runUntilConvergence differ quite a bit.

I am running the Pagerank module on the following graph:

Edge Table:

|| Node1 || Node2 ||
|1 | 2 |
|1 |3|
|3 |2|
|3 |4|
|5 |3|
|6 |7|
|7 |8|
|8 |9|
|9 |7|

Node Table (note the extra node):

|| NodeID  || NodeName  ||
|a |1|
|b |2|
|c |3|
|d |4|
|e |5|
|f |6|
|g |7|
|h |8|
|i |9|
|j.longaddress.com |10|

with a default resetProb of 0.15.
When I compute the pageRank with runUntilConvergence, running 

{{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}

I get the ranks
(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,1.3299054047985106)
(9,1.2381240056453071)
(8,1.2803346052504254)
(10,0.15)
(5,0.15)
(2,0.358781244)

However, when I run page Rank with the run() method, running  

{{val ranksI = PageRank.run(graph,100).vertices}} 

I get the page ranks

(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,0.999387662847)
(9,0.999256447741)
(8,0.999256447741)
(10,0.15)
(5,0.15)
(2,0.295031247)

These are quite different, leading me to suspect that one of the PageRank 
methods is incorrect. I have examined the source, but I do not know what the 
correct fix is, or which set of values is correct.

  was:
I have found a small example where the PageRank values using run and 
runUntilConvergence differ quite a bit.

I am running the Pagerank module on the following graph:

Edge Table:

| |Node1|||Node2 | |
|1 | 2 |
|1 |3|
|3 |2|
|3 |4|
|5 |3|
|6 |7|
|7 |8|
|8 |9|
|9 |7|

Node Table (note the extra node):

|| NodeID  || NodeName  ||
|a |1|
|b |2|
|c |3|
|d |4|
|e |5|
|f |6|
|g |7|
|h |8|
|i |9|
|j.longaddress.com |10|

with a default resetProb of 0.15.
When I compute the pageRank with runUntilConvergence, running 

{{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}

I get the ranks
(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,1.3299054047985106)
(9,1.2381240056453071)
(8,1.2803346052504254)
(10,0.15)
(5,0.15)
(2,0.358781244)

However, when I run page Rank with the run() method, running  

{{val ranksI = PageRank.run(graph,100).vertices}} 

I get the page ranks

(4,0.295031247)
(1,0.15)
(6,0.15)
(3,0.341249994)
(7,0.999387662847)
(9,0.999256447741)
(8,0.999256447741)
(10,0.15)
(5,0.15)
(2,0.295031247)

These are quite different, leading me to suspect that one of the PageRank 
methods is incorrect. I have examined the source, but I do not know what the 
correct fix is, or which set of values is correct.


 Error in PageRank values
 

 Key: SPARK-3206
 URL: https://issues.apache.org/jira/browse/SPARK-3206
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
 Environment: UNIX with Hadoop
Reporter: Peter Fontana

 I have found a small example where the PageRank values using run and 
 runUntilConvergence differ quite a bit.
 I am running the Pagerank module on the following graph:
 Edge Table:
 || Node1 || Node2 ||
 |1 | 2 |
 |1 |  3|
 |3 |  2|
 |3 |  4|
 |5 |  3|
 |6 |  7|
 |7 |  8|
 |8 |  9|
 |9 |  7|
 Node Table (note the extra node):
 || NodeID  || NodeName  ||
 |a |  1|
 |b |  2|
 |c |  3|
 |d |  4|
 |e |  5|
 |f |  6|
 |g |  7|
 |h |  8|
 |i |  9|
 |j.longaddress.com |  10|
 with a default resetProb of 0.15.
 When I compute the pageRank with runUntilConvergence, running 
 {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}
 I get the ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,1.3299054047985106)
 (9,1.2381240056453071)
 (8,1.2803346052504254)
 (10,0.15)
 (5,0.15)
 (2,0.358781244)
 However, when I run page Rank with the run() method, running  
 {{val ranksI = PageRank.run(graph,100).vertices}} 
 I get the page ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,0.999387662847)
 (9,0.999256447741)
 (8,0.999256447741)
 (10,0.15)
 (5,0.15)
 (2,0.295031247)
 These are quite different, leading me to suspect that one of the PageRank 
 methods is incorrect. I have examined the source, but I do not know what the 
 correct fix is, or which set of values is correct.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2189) Method for removing temp tables created by registerAsTable

2014-08-25 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109378#comment-14109378
 ] 

Michael Armbrust commented on SPARK-2189:
-

Thanks for offering to work on this.  Can you briefly describe what you plan to 
do here?  I think there are some subtle interface questions at the moment due 
to the way we handle cached tables vs temporary tables.  Specifically, what 
happens when you cache a table and then call 
unregisterTempTable(cachedTableName).

 Method for removing temp tables created by registerAsTable
 --

 Key: SPARK-2189
 URL: https://issues.apache.org/jira/browse/SPARK-2189
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3207) Choose splits for continuous features in DecisionTree more adaptively

2014-08-25 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3207:


 Summary: Choose splits for continuous features in DecisionTree 
more adaptively
 Key: SPARK-3207
 URL: https://issues.apache.org/jira/browse/SPARK-3207
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


DecisionTree splits on continuous features by choosing an array of values from 
a subsample of the data.

Currently, it does not check for identical values in the subsample, so it could 
end up having multiple copies of the same split.  This is not an error, but it 
could be improved to be more adaptive to the data.

Proposal: In findSplitsBins, check for identical values, and do some searching 
in order to find a set of unique splits.  Reduce the number of splits if there 
are not enough unique candidates.

This would require modifying findSplitsBins and making sure that the number of 
splits/bins (chosen adaptively) is set correctly elsewhere in the code (such as 
in DecisionTreeMetadata).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3147) Implement A/B testing

2014-08-25 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109402#comment-14109402
 ] 

Michael Yannakopoulos commented on SPARK-3147:
--

Hi Xiangrui,

It would be my pleasure to help in the implementation of this task. Not only it 
would
enhance my coding skills but it would also help me learn better the theory 
behind the statistic tests that exist. If you have time and you would like to 
work together,
I would be glad.

Thanks,
Michael

 Implement A/B testing
 -

 Key: SPARK-3147
 URL: https://issues.apache.org/jira/browse/SPARK-3147
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Xiangrui Meng

 A/B testing is widely used to compare online models. We can implement A/B 
 testing in MLlib and integrate it with Spark Streaming. For example, we have 
 a PairDStream[String, Double], whose keys are model ids and values are 
 observations (click or not, or revenue associated with the event). With A/B 
 testing, we can tell whether one model is significantly better than another 
 at a certain time. There are some caveats. For example, we should avoid 
 multiple testing and support A/A testing as a sanity check.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3184) Allow user to specify num tasks to use for a table

2014-08-25 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109439#comment-14109439
 ] 

Michael Armbrust commented on SPARK-3184:
-

Yeah, it looks like is is actually implemented.  Thought it would be nice for 
us to have a real way to do it (instead of hijacking hives way) and to also 
print a deprecation warning when using the hive way).  For those reasons I 
think we can leave this open but decrease the priority.

 Allow user to specify num tasks to use for a table
 --

 Key: SPARK-3184
 URL: https://issues.apache.org/jira/browse/SPARK-3184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Andy Konwinski





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3184) Allow user to specify num tasks to use for a table

2014-08-25 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3184:


Priority: Minor  (was: Major)

 Allow user to specify num tasks to use for a table
 --

 Key: SPARK-3184
 URL: https://issues.apache.org/jira/browse/SPARK-3184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Andy Konwinski
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3208) Hive Parquet SerDe returns null columns

2014-08-25 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3208:
---

 Summary: Hive Parquet SerDe returns null columns
 Key: SPARK-3208
 URL: https://issues.apache.org/jira/browse/SPARK-3208
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Michael Armbrust
Priority: Minor


There is a workaround, which is to set 
'spark.sql.hive.convertMetastoreParquet=true'.  However, it would still be good 
to figure out what is going on here.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3209) bump the version in banner

2014-08-25 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3209:
-

 Summary: bump the version in banner
 Key: SPARK-3209
 URL: https://issues.apache.org/jira/browse/SPARK-3209
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Davies Liu
Priority: Blocker


daviesliu@dm:~/work/spark$ ../spark/bin/spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
  /_/


daviesliu@dm:~/work/spark$ ./bin/pyspark
Python 2.7.5 (default, Mar  9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type help, copyright, credits or license for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
  /_/
Using Python version 2.7.5 (default, Mar  9 2014 22:15:05)
SparkContext available as sc.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3140) PySpark start-up throws confusing exception

2014-08-25 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3140.
---

Resolution: Fixed
  Assignee: Andrew Or

 PySpark start-up throws confusing exception
 ---

 Key: SPARK-3140
 URL: https://issues.apache.org/jira/browse/SPARK-3140
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.1.0


 Currently we read the pyspark port through stdout of the spark-submit 
 subprocess. However, if there is stdout interference, e.g. spark-submit 
 echoes something unexpected to stdout, we print the following:
 {code}
 Exception: Launching GatewayServer failed! (Warning: unexpected output 
 detected.)
 {code}
 This condition is fine. However, we actually throw the same exception if 
 there is *no* output from the subprocess as well. This is very confusing 
 because it implies that the subprocess is outputting something (possibly 
 whitespace, which is not visible) when it's actually not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged

2014-08-25 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109512#comment-14109512
 ] 

Davies Liu commented on SPARK-1764:
---

This issue should be fixed in SPARK-2282 [1], I had ran the jobs above against 
mesos-0.19.1 after more than a hour without problems.

[~therealnb] Could you also verify this?

[1] 
https://github.com/apache/spark/commit/ef4ff00f87a4e8d38866f163f01741c2673e41da

 EOF reached before Python server acknowledged
 -

 Key: SPARK-1764
 URL: https://issues.apache.org/jira/browse/SPARK-1764
 Project: Spark
  Issue Type: Bug
  Components: Mesos, PySpark
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
Priority: Blocker
  Labels: mesos, pyspark

 I'm getting EOF reached before Python server acknowledged while using 
 PySpark on Mesos. The error manifests itself in multiple ways. One is:
 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
 failed due to the error EOF reached before Python server acknowledged; 
 shutting down SparkContext
 And the other has a full stacktrace:
 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
 acknowledged
 org.apache.spark.SparkException: EOF reached before Python server acknowledged
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
   at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This error causes the SparkContext to shutdown. I have not been able to 
 reliably reproduce this bug, it seems to happen randomly, but if you run 
 enough tasks on a SparkContext it'll hapen eventually



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3210) Flume Polling Receiver must be more tolerant to connection failures.

2014-08-25 Thread Hari Shreedharan (JIRA)
Hari Shreedharan created SPARK-3210:
---

 Summary: Flume Polling Receiver must be more tolerant to 
connection failures.
 Key: SPARK-3210
 URL: https://issues.apache.org/jira/browse/SPARK-3210
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3209) bump the version in banner

2014-08-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-3209.
-

Resolution: Invalid

The version number is correct in branch-1.1. 

 bump the version in banner
 --

 Key: SPARK-3209
 URL: https://issues.apache.org/jira/browse/SPARK-3209
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Davies Liu
Priority: Blocker

 daviesliu@dm:~/work/spark$ ../spark/bin/spark-shell
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
   /_/
 daviesliu@dm:~/work/spark$ ./bin/pyspark
 Python 2.7.5 (default, Mar  9 2014, 22:15:05)
 [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
 Type help, copyright, credits or license for more information.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
   /_/
 Using Python version 2.7.5 (default, Mar  9 2014 22:15:05)
 SparkContext available as sc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2840) Improve documentation for decision tree

2014-08-25 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2840.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

 Improve documentation for decision tree
 ---

 Key: SPARK-2840
 URL: https://issues.apache.org/jira/browse/SPARK-2840
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley
 Fix For: 1.1.0


 1. add code examples for Python/Java
 2. add documentation for multiclass classification



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3044) Create RSS feed for Spark News

2014-08-25 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109571#comment-14109571
 ] 

Michael Yannakopoulos commented on SPARK-3044:
--

Hi Nicholas,

I am really interested to work on this issue. Do you know where I can find the
source code of the official [Apache Spark site|http://spark.apache.org]?

Thanks,
Michael

 Create RSS feed for Spark News
 --

 Key: SPARK-3044
 URL: https://issues.apache.org/jira/browse/SPARK-3044
 Project: Spark
  Issue Type: Documentation
Reporter: Nicholas Chammas
Priority: Minor

 Project updates are often posted here: http://spark.apache.org/news/
 Currently, there is no way to subscribe to a feed of these updates. It would 
 be nice there was a way people could be notified of new posts there without 
 having to check manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3211) .take() is OOM-prone when there are empty partitions

2014-08-25 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-3211:
-

 Summary: .take() is OOM-prone when there are empty partitions
 Key: SPARK-3211
 URL: https://issues.apache.org/jira/browse/SPARK-3211
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Andrew Ash


Filed on dev@ on 22 August by [~pnepywoda]:

{quote}
On line 777
https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771
the logic for take() reads ALL partitions if the first one (or first k) are
empty. This has actually lead to OOMs when we had many partitions
(thousands) and unfortunately the first one was empty.

Wouldn't a better implementation strategy be

numPartsToTry = partsScanned * 2

instead of

numPartsToTry = totalParts - 1

(this doubling is similar to most memory allocation strategies)

Thanks!
- Paul
{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1764) EOF reached before Python server acknowledged

2014-08-25 Thread nigel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109583#comment-14109583
 ] 

nigel commented on SPARK-1764:
--

Hi;
Sadly I moved jobs and I don't have a working Spark environment at the moment 
(I will be doing some Spark work soon :-). I'll pass this on to the guys that 
are still there and get them to confirm. 
Cheers

 EOF reached before Python server acknowledged
 -

 Key: SPARK-1764
 URL: https://issues.apache.org/jira/browse/SPARK-1764
 Project: Spark
  Issue Type: Bug
  Components: Mesos, PySpark
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
Assignee: Davies Liu
Priority: Blocker
  Labels: mesos, pyspark

 I'm getting EOF reached before Python server acknowledged while using 
 PySpark on Mesos. The error manifests itself in multiple ways. One is:
 14/05/08 18:10:40 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
 failed due to the error EOF reached before Python server acknowledged; 
 shutting down SparkContext
 And the other has a full stacktrace:
 14/05/08 18:03:06 ERROR OneForOneStrategy: EOF reached before Python server 
 acknowledged
 org.apache.spark.SparkException: EOF reached before Python server acknowledged
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:416)
   at 
 org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:387)
   at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:71)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:279)
   at 
 org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:277)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
   at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.Accumulators$.add(Accumulators.scala:277)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:818)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This error causes the SparkContext to shutdown. I have not been able to 
 reliably reproduce this bug, it seems to happen randomly, but if you run 
 enough tasks on a SparkContext it'll hapen eventually



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.

2014-08-25 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109587#comment-14109587
 ] 

Michael Armbrust commented on SPARK-2087:
-

You can't make temporary tables yet, but you will be able to when we add the 
CACHE TABLE ... AS SELECT... syntax 
https://issues.apache.org/jira/browse/SPARK-2594.

 Clean Multi-user semantics for thrift JDBC/ODBC server.
 ---

 Key: SPARK-2087
 URL: https://issues.apache.org/jira/browse/SPARK-2087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
Priority: Minor

 Configuration and temporary tables should exist per-user.  Cached tables 
 should be shared across users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3211) .take() is OOM-prone when there are empty partitions

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109591#comment-14109591
 ] 

Apache Spark commented on SPARK-3211:
-

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/2117

 .take() is OOM-prone when there are empty partitions
 

 Key: SPARK-3211
 URL: https://issues.apache.org/jira/browse/SPARK-3211
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Andrew Ash

 Filed on dev@ on 22 August by [~pnepywoda]:
 {quote}
 On line 777
 https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771
 the logic for take() reads ALL partitions if the first one (or first k) are
 empty. This has actually lead to OOMs when we had many partitions
 (thousands) and unfortunately the first one was empty.
 Wouldn't a better implementation strategy be
 numPartsToTry = partsScanned * 2
 instead of
 numPartsToTry = totalParts - 1
 (this doubling is similar to most memory allocation strategies)
 Thanks!
 - Paul
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3205) input format for text records saved with in-record delimiter and newline characters escaped

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109592#comment-14109592
 ] 

Apache Spark commented on SPARK-3205:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/2118

 input format for text records saved with in-record delimiter and newline 
 characters escaped
 ---

 Key: SPARK-3205
 URL: https://issues.apache.org/jira/browse/SPARK-3205
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Text records may contain in-record delimiter or newline characters. In such 
 cases, we can either encode them or escape them. The latter is simpler and 
 used by Redshift's UNLOAD with the ESCAPE option. The problem is that a 
 record will span multiple lines. We need an input format for it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2495) Ability to re-create ML models

2014-08-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2495.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 2112
[https://github.com/apache/spark/pull/2112]

 Ability to re-create ML models
 --

 Key: SPARK-2495
 URL: https://issues.apache.org/jira/browse/SPARK-2495
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Alexander Albul
Assignee: Alexander Albul
 Fix For: 1.1.0


 Hi everyone.
 Previously (prior to Spark 1.0) we was working with MLib like this:
 1) Calculate model (costly operation)
 2) Take model and collect it's fields like weights, intercept e.t.c.
 3) Store model somewhere in our format
 4) Do predictions by loading model attributes, creating new model and 
 predicting using it.
 Now i see that model's constructors have *private* modifier and cannot be 
 created from outside.
 If you want to hide implementation details and keep this constructor as 
 developer api, why not to create at least method, which will take weights, 
 intercept (for example) an materialize that model?
 A good example of model that i am talking about is: *LinearRegressionModel*
 I know that *LinearRegressionWithSGD* class have *createModel* method but the 
 problem is that it have *protected* modifier as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2798) Correct several small errors in Flume module pom.xml files

2014-08-25 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-2798.
--

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Sean Owen

 Correct several small errors in Flume module pom.xml files
 --

 Key: SPARK-2798
 URL: https://issues.apache.org/jira/browse/SPARK-2798
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 (EDIT) Since the scalatest issue was since resolved, this is now about a few 
 small problems in the Flume Sink pom.xml 
 - scalatest is not declared as a test-scope dependency
 - Its Avro version doesn't match the rest of the build
 - Its Flume version is not synced with the other Flume module
 - The other Flume module declares its dependency on Flume Sink slightly 
 incorrectly, hard-coding the Scala 2.10 version
 - It depends on Scala Lang directly, which it shouldn't



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2798) Correct several small errors in Flume module pom.xml files

2014-08-25 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2798:
-

Affects Version/s: (was: 1.0.1)

 Correct several small errors in Flume module pom.xml files
 --

 Key: SPARK-2798
 URL: https://issues.apache.org/jira/browse/SPARK-2798
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 (EDIT) Since the scalatest issue was since resolved, this is now about a few 
 small problems in the Flume Sink pom.xml 
 - scalatest is not declared as a test-scope dependency
 - Its Avro version doesn't match the rest of the build
 - Its Flume version is not synced with the other Flume module
 - The other Flume module declares its dependency on Flume Sink slightly 
 incorrectly, hard-coding the Scala 2.10 version
 - It depends on Scala Lang directly, which it shouldn't



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2798) Correct several small errors in Flume module pom.xml files

2014-08-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109675#comment-14109675
 ] 

Sean Owen commented on SPARK-2798:
--

[~tdas] Cool, I think this closes SPARK-3169 too if I understand correctly

 Correct several small errors in Flume module pom.xml files
 --

 Key: SPARK-2798
 URL: https://issues.apache.org/jira/browse/SPARK-2798
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 (EDIT) Since the scalatest issue was since resolved, this is now about a few 
 small problems in the Flume Sink pom.xml 
 - scalatest is not declared as a test-scope dependency
 - Its Avro version doesn't match the rest of the build
 - Its Flume version is not synced with the other Flume module
 - The other Flume module declares its dependency on Flume Sink slightly 
 incorrectly, hard-coding the Scala 2.10 version
 - It depends on Scala Lang directly, which it shouldn't



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3044) Create RSS feed for Spark News

2014-08-25 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109677#comment-14109677
 ] 

Nicholas Chammas commented on SPARK-3044:
-

Hi Michael,

I don't know if the site itself is open-source. We might need someone from 
Databricks to update it.

[~pwendell], [~rxin] - Is it possible for contributors to contribute to the 
[main Spark site|http://spark.apache.org/]?

 Create RSS feed for Spark News
 --

 Key: SPARK-3044
 URL: https://issues.apache.org/jira/browse/SPARK-3044
 Project: Spark
  Issue Type: Documentation
Reporter: Nicholas Chammas
Priority: Minor

 Project updates are often posted here: http://spark.apache.org/news/
 Currently, there is no way to subscribe to a feed of these updates. It would 
 be nice there was a way people could be notified of new posts there without 
 having to check manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3213:


 Summary: spark_ec2.py cannot find slave instances
 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker


spark_ec2.py cannot find all slave instances.  In particular:
* I created a master  slave and configured them.
* I created new slave instances from the original slave (Launch More Like 
This).
* I tried to relaunch the cluster, and it could only find the original slave.

Old versions of the script worked.  The latest working commit which edited that 
.py script is: a0bcbc159e89be868ccc96175dbf1439461557e1

There may be a problem with this PR: 
[https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109697#comment-14109697
 ] 

Joseph K. Bradley commented on SPARK-3213:
--

[~vidaha]  Please take a look.  Thanks!

 spark_ec2.py cannot find slave instances
 

 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker

 spark_ec2.py cannot find all slave instances.  In particular:
 * I created a master  slave and configured them.
 * I created new slave instances from the original slave (Launch More Like 
 This).
 * I tried to relaunch the cluster, and it could only find the original slave.
 Old versions of the script worked.  The latest working commit which edited 
 that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
 There may be a problem with this PR: 
 [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109700#comment-14109700
 ] 

Joseph K. Bradley commented on SPARK-3213:
--

The security group name I was using was joseph-r3.2xlarge-slaves  It may be a 
regex/matching issue.

 spark_ec2.py cannot find slave instances
 

 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker

 spark_ec2.py cannot find all slave instances.  In particular:
 * I created a master  slave and configured them.
 * I created new slave instances from the original slave (Launch More Like 
 This).
 * I tried to relaunch the cluster, and it could only find the original slave.
 Old versions of the script worked.  The latest working commit which edited 
 that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
 There may be a problem with this PR: 
 [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3156) DecisionTree: Order categorical features adaptively

2014-08-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3156:
-

Assignee: Joseph K. Bradley

 DecisionTree: Order categorical features adaptively
 ---

 Key: SPARK-3156
 URL: https://issues.apache.org/jira/browse/SPARK-3156
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 Improvement: accuracy
 Currently, ordered categorical features use a fixed bin ordering chosen 
 before training based on a subsample of the data.  (See the code using 
 centroids in findSplitsBins().)
 Proposal: Choose the ordering adaptively for every split.  This would require 
 a bit more computation on the master, but could improve results by splitting 
 more intelligently.
 Required changes: The result of aggregation is used in 
 findAggForOrderedFeatureClassification() to compute running totals over the 
 pre-set ordering of categorical feature values.  The stats should instead be 
 used to choose a new ordering of categories, before computing running totals.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3180) Better control of security groups

2014-08-25 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3180.
---

   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 2088
[https://github.com/apache/spark/pull/2088]

 Better control of security groups
 -

 Key: SPARK-3180
 URL: https://issues.apache.org/jira/browse/SPARK-3180
 Project: Spark
  Issue Type: Improvement
Reporter: Allan Douglas R. de Oliveira
 Fix For: 1.3.0


 Two features can be combined together to provide better control of security 
 group policies:
 - The ability to specify the address authorized to access the default 
 security group (instead of letting everyone: 0.0.0.0/0)
 - The possibility to place the created machines on a custom security group
 One can use the combinations of the two flags to restrict external access to 
 the provided security group (e.g by setting the authorized address to 
 127.0.0.1/32) while maintaining compatibility with the current behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3214) Argument parsing loop in make-distribution.sh ends prematurely

2014-08-25 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-3214:
-

 Summary: Argument parsing loop in make-distribution.sh ends 
prematurely
 Key: SPARK-3214
 URL: https://issues.apache.org/jira/browse/SPARK-3214
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Minor


Running {{make-distribution.sh}} in this way:
{code}
./make-distribution.sh --hadoop -Pyarn
{code}
results in a proper error message:
{code}
Error: '--hadoop' is no longer supported:
Error: use Maven options -Phadoop.version and -Pyarn.version
{code}
But if you running it with options placed in reverse order, it just passes:
{code}
./make-distribution.sh -Pyarn --hadoop
{code}
The reason is that the {{while}} loop ends prematurely before checking all 
potentially deprecated command line options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2798) Correct several small errors in Flume module pom.xml files

2014-08-25 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109735#comment-14109735
 ] 

Tathagata Das commented on SPARK-2798:
--

Naah, that was already closed by the fix I did on friday 
(https://github.com/apache/spark/pull/2101). Maven and therefore 
make-distribution should work fine with that fix. 

 Correct several small errors in Flume module pom.xml files
 --

 Key: SPARK-2798
 URL: https://issues.apache.org/jira/browse/SPARK-2798
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 (EDIT) Since the scalatest issue was since resolved, this is now about a few 
 small problems in the Flume Sink pom.xml 
 - scalatest is not declared as a test-scope dependency
 - Its Avro version doesn't match the rest of the build
 - Its Flume version is not synced with the other Flume module
 - The other Flume module declares its dependency on Flume Sink slightly 
 incorrectly, hard-coding the Scala 2.10 version
 - It depends on Scala Lang directly, which it shouldn't



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3214) Argument parsing loop in make-distribution.sh ends prematurely

2014-08-25 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109765#comment-14109765
 ] 

Cheng Lian commented on SPARK-3214:
---

Didn't realize all Maven options must go after other {{make-distribution.sh}} 
options. Closing this.

 Argument parsing loop in make-distribution.sh ends prematurely
 --

 Key: SPARK-3214
 URL: https://issues.apache.org/jira/browse/SPARK-3214
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Minor

 Running {{make-distribution.sh}} in this way:
 {code}
 ./make-distribution.sh --hadoop -Pyarn
 {code}
 results in a proper error message:
 {code}
 Error: '--hadoop' is no longer supported:
 Error: use Maven options -Phadoop.version and -Pyarn.version
 {code}
 But if you running it with options placed in reverse order, it just passes:
 {code}
 ./make-distribution.sh -Pyarn --hadoop
 {code}
 The reason is that the {{while}} loop ends prematurely before checking all 
 potentially deprecated command line options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3214) Argument parsing loop in make-distribution.sh ends prematurely

2014-08-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-3214.
-

Resolution: Not a Problem

 Argument parsing loop in make-distribution.sh ends prematurely
 --

 Key: SPARK-3214
 URL: https://issues.apache.org/jira/browse/SPARK-3214
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Minor

 Running {{make-distribution.sh}} in this way:
 {code}
 ./make-distribution.sh --hadoop -Pyarn
 {code}
 results in a proper error message:
 {code}
 Error: '--hadoop' is no longer supported:
 Error: use Maven options -Phadoop.version and -Pyarn.version
 {code}
 But if you running it with options placed in reverse order, it just passes:
 {code}
 ./make-distribution.sh -Pyarn --hadoop
 {code}
 The reason is that the {{while}} loop ends prematurely before checking all 
 potentially deprecated command line options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109818#comment-14109818
 ] 

Vida Ha commented on SPARK-3213:


Joseph, Josh,  I discussed in person. 

There is a quick workarounds:

1) Use an old version of the spark_ec2 scripts that uses security groups to 
identify the slaves, if using Launch more like this

But now I need to investigate:

If using launch more like this, it does seem like amazon tries to reuse the 
tags, but I'm wondering if it doesn't like having multiple machines with the 
same Name tag.  I will try using a different tag, like spark-ec2-cluster-id 
or something like that to identify the machine.  If that tag does copy over, 
then we can properly support Launch more like this.

 spark_ec2.py cannot find slave instances
 

 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker

 spark_ec2.py cannot find all slave instances.  In particular:
 * I created a master  slave and configured them.
 * I created new slave instances from the original slave (Launch More Like 
 This).
 * I tried to relaunch the cluster, and it could only find the original slave.
 Old versions of the script worked.  The latest working commit which edited 
 that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
 There may be a problem with this PR: 
 [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109818#comment-14109818
 ] 

Vida Ha edited comment on SPARK-3213 at 8/25/14 9:57 PM:
-

Joseph, Josh,  I discussed in person. 

There is a quick workaround:

1) Use an old version of the spark_ec2 scripts that uses security groups to 
identify the slaves, if using Launch more like this

2) Avoid using Launch more like this

But now I need to investigate:

If using launch more like this, it does seem like amazon tries to reuse the 
tags, but I'm wondering if it doesn't like having multiple machines with the 
same Name tag.  I will try using a different tag, like spark-ec2-cluster-id 
or something like that to identify the machine.  If that tag does copy over, 
then we can properly support Launch more like this.


was (Author: vidaha):
Joseph, Josh,  I discussed in person. 

There is a quick workarounds:

1) Use an old version of the spark_ec2 scripts that uses security groups to 
identify the slaves, if using Launch more like this

But now I need to investigate:

If using launch more like this, it does seem like amazon tries to reuse the 
tags, but I'm wondering if it doesn't like having multiple machines with the 
same Name tag.  I will try using a different tag, like spark-ec2-cluster-id 
or something like that to identify the machine.  If that tag does copy over, 
then we can properly support Launch more like this.

 spark_ec2.py cannot find slave instances
 

 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker

 spark_ec2.py cannot find all slave instances.  In particular:
 * I created a master  slave and configured them.
 * I created new slave instances from the original slave (Launch More Like 
 This).
 * I tried to relaunch the cluster, and it could only find the original slave.
 Old versions of the script worked.  The latest working commit which edited 
 that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
 There may be a problem with this PR: 
 [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109828#comment-14109828
 ] 

Vida Ha commented on SPARK-3213:


Can someone rename this issue to:

spark_ec2.py cannot find slave instances launched with Launch More Like This

I think that's more indicative of the issue - it's not wider than that.

 spark_ec2.py cannot find slave instances
 

 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker

 spark_ec2.py cannot find all slave instances.  In particular:
 * I created a master  slave and configured them.
 * I created new slave instances from the original slave (Launch More Like 
 This).
 * I tried to relaunch the cluster, and it could only find the original slave.
 Old versions of the script worked.  The latest working commit which edited 
 that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
 There may be a problem with this PR: 
 [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3213) spark_ec2.py cannot find slave instances launched with Launch More Like This

2014-08-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3213:
-

Summary: spark_ec2.py cannot find slave instances launched with Launch 
More Like This  (was: spark_ec2.py cannot find slave instances)

 spark_ec2.py cannot find slave instances launched with Launch More Like This
 --

 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker

 spark_ec2.py cannot find all slave instances.  In particular:
 * I created a master  slave and configured them.
 * I created new slave instances from the original slave (Launch More Like 
 This).
 * I tried to relaunch the cluster, and it could only find the original slave.
 Old versions of the script worked.  The latest working commit which edited 
 that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
 There may be a problem with this PR: 
 [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3215) Add remote interface for SparkContext

2014-08-25 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-3215:
-

 Summary: Add remote interface for SparkContext
 Key: SPARK-3215
 URL: https://issues.apache.org/jira/browse/SPARK-3215
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Marcelo Vanzin


A quick description of the issue: as part of running Hive jobs on top of Spark, 
it's desirable to have a SparkContext that is running in the background and 
listening for job requests for a particular user session.

Running multiple contexts in the same JVM is not a very good solution. Not only 
SparkContext currently has issues sharing the same JVM among multiple 
instances, but that turns the JVM running the contexts into a huge bottleneck 
in the system.

So I'm proposing a solution where we have a SparkContext that is running in a 
separate process, and listening for requests from the client application via 
some RPC interface (most probably Akka).

I'll attach a document shortly with the current proposal. Let's use this bug to 
discuss the proposal and any other suggestions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3215) Add remote interface for SparkContext

2014-08-25 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-3215:
--

Attachment: RemoteSparkContext.pdf

Initial proposal for a remote context interface.

Note that this is not a formal design document, just a high-level proposal, so 
it doesn't go deeply into what APIs would be exposed on anything like that.

 Add remote interface for SparkContext
 -

 Key: SPARK-3215
 URL: https://issues.apache.org/jira/browse/SPARK-3215
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Marcelo Vanzin
  Labels: hive
 Attachments: RemoteSparkContext.pdf


 A quick description of the issue: as part of running Hive jobs on top of 
 Spark, it's desirable to have a SparkContext that is running in the 
 background and listening for job requests for a particular user session.
 Running multiple contexts in the same JVM is not a very good solution. Not 
 only SparkContext currently has issues sharing the same JVM among multiple 
 instances, but that turns the JVM running the contexts into a huge bottleneck 
 in the system.
 So I'm proposing a solution where we have a SparkContext that is running in a 
 separate process, and listening for requests from the client application via 
 some RPC interface (most probably Akka).
 I'll attach a document shortly with the current proposal. Let's use this bug 
 to discuss the proposal and any other suggestions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3216) Spark-shell is broken for branch-1.0

2014-08-25 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3216:


 Summary: Spark-shell is broken for branch-1.0
 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Priority: Blocker


This fails when EC2 tries to clone the most recent version of Spark from 
branch-1.0. I marked this a blocker because this is completely broken, but it 
is technically not blocking anything.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-3217:
-

 Summary: Shaded Guava jar doesn't play well with Maven build
 Key: SPARK-3217
 URL: https://issues.apache.org/jira/browse/SPARK-3217
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Blocker


PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file and 
moved Guava classes to package {{org.spark-project.guava}} when Spark is built 
by Maven. But code in {{org.apache.spark.util.Utils}} still refers to classes 
(e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.

The result is that, when Spark is built with Maven (or 
{{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
{{ClassNotFoundException}}:
{code}
# Build Spark with Maven
$ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
...

# Then spark-shell complains
$ ./bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread main java.lang.NoClassDefFoundError: 
com/google/common/util/concurrent/ThreadFactoryBuilder
at org.apache.spark.util.Utils$.init(Utils.scala:636)
at org.apache.spark.util.Utils$.clinit(Utils.scala)
at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134)
at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65)
at org.apache.spark.repl.Main$.main(Main.scala:30)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: 
com.google.common.util.concurrent.ThreadFactoryBuilder
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 13 more

# Check the assembly jar file
$ jar tf 
assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | grep 
-i ThreadFactoryBuilder
org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
{code}
SBT build is fine since we don't shade Guava with SBT right now (and that's why 
Jenkins didn't complain about this).

Possible solutions can be:
# revert PR #1813 for safe, or
# also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3216) Spark-shell is broken for branch-1.0

2014-08-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109947#comment-14109947
 ] 

Apache Spark commented on SPARK-3216:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2122

 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Priority: Blocker

 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. I marked this a blocker because this is completely broken, but it 
 is technically not blocking anything.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-25 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3189:
-

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-3188

 Add Robust Regression Algorithm with Turkey bisquare weight  function 
 (Biweight Estimates) 
 ---

 Key: SPARK-3189
 URL: https://issues.apache.org/jira/browse/SPARK-3189
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression to employ a 
 fitting criterion that is not as vulnerable as least square.
 The Turkey bisquare weight function, also referred to as the biweight 
 function, produces and M-estimator that is more resistant to regression 
 outliers than the Huber M-estimator ()Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0

2014-08-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3216:
-

Description: 
This fails when EC2 tries to clone the most recent version of Spark from 
branch-1.0. This does not actually affect any released distributions, and so I 
did not set the affected/fix/target versions. I marked this a blocker because 
this is completely broken, but it is technically not blocking anything.

This was caused by https://github.com/apache/spark/pull/1831, which broke 
spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was 
only merged into branch-1.1 and master, but not branch-1.0.

  was:
This fails when EC2 tries to clone the most recent version of Spark from 
branch-1.0. I marked this a blocker because this is completely broken, but it 
is technically not blocking anything.

This was caused by https://github.com/apache/spark/pull/1831, which broke 
spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was 
only merged into branch-1.1 and master, but not branch-1.0


 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Priority: Blocker

 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. This does not actually affect any released distributions, and so 
 I did not set the affected/fix/target versions. I marked this a blocker 
 because this is completely broken, but it is technically not blocking 
 anything.
 This was caused by https://github.com/apache/spark/pull/1831, which broke 
 spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
 was only merged into branch-1.1 and master, but not branch-1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0

2014-08-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3216:
-

Description: 
This fails when EC2 tries to clone the most recent version of Spark from 
branch-1.0. I marked this a blocker because this is completely broken, but it 
is technically not blocking anything.

This was caused by https://github.com/apache/spark/pull/1831, which broke 
spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 was 
only merged into branch-1.1 and master, but not branch-1.0

  was:This fails when EC2 tries to clone the most recent version of Spark from 
branch-1.0. I marked this a blocker because this is completely broken, but it 
is technically not blocking anything.


 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Priority: Blocker

 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. I marked this a blocker because this is completely broken, but it 
 is technically not blocking anything.
 This was caused by https://github.com/apache/spark/pull/1831, which broke 
 spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
 was only merged into branch-1.1 and master, but not branch-1.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)

2014-08-25 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3188:
-

Summary: Add Robust Regression Algorithm with Tukey bisquare weight  
function (Biweight Estimates)   (was: Add Robust Regression Algorithm with 
Turkey bisquare weight  function (Biweight Estimates) )

 Add Robust Regression Algorithm with Tukey bisquare weight  function 
 (Biweight Estimates) 
 --

 Key: SPARK-3188
 URL: https://issues.apache.org/jira/browse/SPARK-3188
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression to employ a 
 fitting criterion that is not as vulnerable as least square.
 The Turkey bisquare weight function, also referred to as the biweight 
 function, produces an M-estimator that is more resistant to regression 
 outliers than the Huber M-estimator (Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)

2014-08-25 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3188:
-

Description: 
Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Tukey bisquare weight function, also referred to as the biweight function, 
produces an M-estimator that is more resistant to regression outliers than the 
Huber M-estimator (Andersen 2008: 19).



  was:
Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces an M-estimator that is more resistant to regression outliers than the 
Huber M-estimator (Andersen 2008: 19).




 Add Robust Regression Algorithm with Tukey bisquare weight  function 
 (Biweight Estimates) 
 --

 Key: SPARK-3188
 URL: https://issues.apache.org/jira/browse/SPARK-3188
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression to employ a 
 fitting criterion that is not as vulnerable as least square.
 The Tukey bisquare weight function, also referred to as the biweight 
 function, produces an M-estimator that is more resistant to regression 
 outliers than the Huber M-estimator (Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3204) MaxOf would be foldable if both left and right are foldable.

2014-08-25 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3204.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Takuya Ueshin

 MaxOf would be foldable if both left and right are foldable.
 

 Key: SPARK-3204
 URL: https://issues.apache.org/jira/browse/SPARK-3204
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2929) Rewrite HiveThriftServer2Suite and CliSuite

2014-08-25 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2929.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Rewrite HiveThriftServer2Suite and CliSuite
 ---

 Key: SPARK-2929
 URL: https://issues.apache.org/jira/browse/SPARK-2929
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.1.0


 {{HiveThriftServer2Suite}} and {{CliSuite}} were inherited from Shark and 
 contain too may hard coded timeouts and timing assumptions when doing IPC. 
 This makes these tests both flaky and slow.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3179) Add task OutputMetrics

2014-08-25 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109997#comment-14109997
 ] 

Michael Yannakopoulos commented on SPARK-3179:
--

Hi Sandy,

I am willing to help with this issue. I am a new to Apache Spark and I have made
few contributions so far. Under your supervision I can work on this issue.

Thanks,
Michael

 Add task OutputMetrics
 --

 Key: SPARK-3179
 URL: https://issues.apache.org/jira/browse/SPARK-3179
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza

 Track the bytes that tasks write to HDFS or other output destinations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3061) Maven build fails in Windows OS

2014-08-25 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3061:
--

Affects Version/s: 1.1.0

Maybe we can use a Maven plugin to unzip?  
http://stackoverflow.com/questions/3264064/unpack-zip-in-zip-with-maven

 Maven build fails in Windows OS
 ---

 Key: SPARK-3061
 URL: https://issues.apache.org/jira/browse/SPARK-3061
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
 Environment: Windows
Reporter: Masayoshi TSUZUKI
Priority: Minor

 Maven build fails in Windows OS with this error message.
 {noformat}
 [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec 
 (default) on project spark-core_2.10: Command execution failed. Cannot run 
 program unzip (in directory C:\path\to\gitofspark\python): CreateProcess 
 error=2, Žw’肳‚ꂽƒtƒ@ƒ - [Help 1]
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2087) Clean Multi-user semantics for thrift JDBC/ODBC server.

2014-08-25 Thread Yi Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110026#comment-14110026
 ] 

Yi Tian commented on SPARK-2087:


You mean the CACHE TABLE ... AS SELECT... syntax will create temporary table, 
and could not be found by other session? 
I'm still confusing about the different between temporary table and cached 
tables.

 Clean Multi-user semantics for thrift JDBC/ODBC server.
 ---

 Key: SPARK-2087
 URL: https://issues.apache.org/jira/browse/SPARK-2087
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
Priority: Minor

 Configuration and temporary tables should exist per-user.  Cached tables 
 should be shared across users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110035#comment-14110035
 ] 

Marcelo Vanzin commented on SPARK-3217:
---

Just did a git clean -dfx on master and rebuilt using maven. This works fine 
for me.

Did you by any chance do one of the following:
- forget to clean after pulling that change
- mix sbt and mvn built artifacts in the same build
- set SPARK_PREPEND_CLASSES

I can see any of those causing this issue. I think only the last one is 
something we need to worry about; we now need to figure out a way to add the 
guava jar to the classpath when using that option.

 Shaded Guava jar doesn't play well with Maven build
 ---

 Key: SPARK-3217
 URL: https://issues.apache.org/jira/browse/SPARK-3217
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Cheng Lian
Priority: Blocker

 PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
 and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
 built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
 classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
 The result is that, when Spark is built with Maven (or 
 {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
 {{ClassNotFoundException}}:
 {code}
 # Build Spark with Maven
 $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
 ...
 # Then spark-shell complains
 $ ./bin/spark-shell
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Exception in thread main java.lang.NoClassDefFoundError: 
 com/google/common/util/concurrent/ThreadFactoryBuilder
 at org.apache.spark.util.Utils$.init(Utils.scala:636)
 at org.apache.spark.util.Utils$.clinit(Utils.scala)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65)
 at org.apache.spark.repl.Main$.main(Main.scala:30)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException: 
 com.google.common.util.concurrent.ThreadFactoryBuilder
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 13 more
 # Check the assembly jar file
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
 grep -i ThreadFactoryBuilder
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
 {code}
 SBT build is fine since we don't shade Guava with SBT right now (and that's 
 why Jenkins didn't complain about this).
 Possible solutions can be:
 # revert PR #1813 for safe, or
 # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
 Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3217:
--

Affects Version/s: (was: 1.0.2)

 Shaded Guava jar doesn't play well with Maven build
 ---

 Key: SPARK-3217
 URL: https://issues.apache.org/jira/browse/SPARK-3217
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Cheng Lian
Priority: Blocker

 PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
 and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
 built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
 classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
 The result is that, when Spark is built with Maven (or 
 {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
 {{ClassNotFoundException}}:
 {code}
 # Build Spark with Maven
 $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
 ...
 # Then spark-shell complains
 $ ./bin/spark-shell
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Exception in thread main java.lang.NoClassDefFoundError: 
 com/google/common/util/concurrent/ThreadFactoryBuilder
 at org.apache.spark.util.Utils$.init(Utils.scala:636)
 at org.apache.spark.util.Utils$.clinit(Utils.scala)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65)
 at org.apache.spark.repl.Main$.main(Main.scala:30)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException: 
 com.google.common.util.concurrent.ThreadFactoryBuilder
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 13 more
 # Check the assembly jar file
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
 grep -i ThreadFactoryBuilder
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
 {code}
 SBT build is fine since we don't shade Guava with SBT right now (and that's 
 why Jenkins didn't complain about this).
 Possible solutions can be:
 # revert PR #1813 for safe, or
 # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
 Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3217:
--

Labels: 1.2.0  (was: )

 Shaded Guava jar doesn't play well with Maven build
 ---

 Key: SPARK-3217
 URL: https://issues.apache.org/jira/browse/SPARK-3217
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Cheng Lian
Priority: Blocker
  Labels: 1.2.0

 PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
 and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
 built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
 classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
 The result is that, when Spark is built with Maven (or 
 {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
 {{ClassNotFoundException}}:
 {code}
 # Build Spark with Maven
 $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
 ...
 # Then spark-shell complains
 $ ./bin/spark-shell
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Exception in thread main java.lang.NoClassDefFoundError: 
 com/google/common/util/concurrent/ThreadFactoryBuilder
 at org.apache.spark.util.Utils$.init(Utils.scala:636)
 at org.apache.spark.util.Utils$.clinit(Utils.scala)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65)
 at org.apache.spark.repl.Main$.main(Main.scala:30)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException: 
 com.google.common.util.concurrent.ThreadFactoryBuilder
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 13 more
 # Check the assembly jar file
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
 grep -i ThreadFactoryBuilder
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
 {code}
 SBT build is fine since we don't shade Guava with SBT right now (and that's 
 why Jenkins didn't complain about this).
 Possible solutions can be:
 # revert PR #1813 for safe, or
 # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
 Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3217:
--

Target Version/s: 1.2.0  (was: 1.1.0)

 Shaded Guava jar doesn't play well with Maven build
 ---

 Key: SPARK-3217
 URL: https://issues.apache.org/jira/browse/SPARK-3217
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Cheng Lian
Priority: Blocker
  Labels: 1.2.0

 PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
 and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
 built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
 classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
 The result is that, when Spark is built with Maven (or 
 {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
 {{ClassNotFoundException}}:
 {code}
 # Build Spark with Maven
 $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
 ...
 # Then spark-shell complains
 $ ./bin/spark-shell
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Exception in thread main java.lang.NoClassDefFoundError: 
 com/google/common/util/concurrent/ThreadFactoryBuilder
 at org.apache.spark.util.Utils$.init(Utils.scala:636)
 at org.apache.spark.util.Utils$.clinit(Utils.scala)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65)
 at org.apache.spark.repl.Main$.main(Main.scala:30)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException: 
 com.google.common.util.concurrent.ThreadFactoryBuilder
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 13 more
 # Check the assembly jar file
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
 grep -i ThreadFactoryBuilder
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
 {code}
 SBT build is fine since we don't shade Guava with SBT right now (and that's 
 why Jenkins didn't complain about this).
 Possible solutions can be:
 # revert PR #1813 for safe, or
 # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
 Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3058) Support EXTENDED for EXPLAIN command

2014-08-25 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3058.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Support EXTENDED for EXPLAIN command
 

 Key: SPARK-3058
 URL: https://issues.apache.org/jira/browse/SPARK-3058
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor
 Fix For: 1.1.0


 Currently, it's no difference when run the command EXPLAIN w or w/o 
 EXTENDED keywords, this patch will show more details of the query plan when 
 EXTENDED keyword provided.
 {panel:title=EXPLAIN with EXTENDED}
 explain extended select key as a1, value as a2 from src where key=1;
 == Parsed Logical Plan ==
 Project ['key AS a1#3,'value AS a2#4]
  Filter ('key = 1)
   UnresolvedRelation None, src, None
 == Analyzed Logical Plan ==
 Project [key#8 AS a1#3,value#9 AS a2#4]
  Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
   MetastoreRelation default, src, None
 == Optimized Logical Plan ==
 Project [key#8 AS a1#3,value#9 AS a2#4]
  Filter (CAST(key#8, DoubleType) = 1.0)
   MetastoreRelation default, src, None
 == Physical Plan ==
 Project [key#8 AS a1#3,value#9 AS a2#4]
  Filter (CAST(key#8, DoubleType) = 1.0)
   HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None
 Code Generation: false
 == RDD ==
 (2) MappedRDD[14] at map at HiveContext.scala:350
   MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
   MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
   MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
   MappedRDD[10] at map at TableReader.scala:240
   HadoopRDD[9] at HadoopRDD at TableReader.scala:230
 {panel}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3217:
---

Labels:   (was: 1.2.0)

 Shaded Guava jar doesn't play well with Maven build
 ---

 Key: SPARK-3217
 URL: https://issues.apache.org/jira/browse/SPARK-3217
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Cheng Lian
Priority: Blocker

 PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
 and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
 built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
 classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
 The result is that, when Spark is built with Maven (or 
 {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
 {{ClassNotFoundException}}:
 {code}
 # Build Spark with Maven
 $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
 ...
 # Then spark-shell complains
 $ ./bin/spark-shell
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Exception in thread main java.lang.NoClassDefFoundError: 
 com/google/common/util/concurrent/ThreadFactoryBuilder
 at org.apache.spark.util.Utils$.init(Utils.scala:636)
 at org.apache.spark.util.Utils$.clinit(Utils.scala)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65)
 at org.apache.spark.repl.Main$.main(Main.scala:30)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException: 
 com.google.common.util.concurrent.ThreadFactoryBuilder
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 13 more
 # Check the assembly jar file
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
 grep -i ThreadFactoryBuilder
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
 {code}
 SBT build is fine since we don't shade Guava with SBT right now (and that's 
 why Jenkins didn't complain about this).
 Possible solutions can be:
 # revert PR #1813 for safe, or
 # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
 Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3217:
---

Affects Version/s: 1.2.0

 Shaded Guava jar doesn't play well with Maven build
 ---

 Key: SPARK-3217
 URL: https://issues.apache.org/jira/browse/SPARK-3217
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Cheng Lian
Priority: Blocker

 PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
 and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
 built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
 classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
 The result is that, when Spark is built with Maven (or 
 {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
 {{ClassNotFoundException}}:
 {code}
 # Build Spark with Maven
 $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
 ...
 # Then spark-shell complains
 $ ./bin/spark-shell
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Exception in thread main java.lang.NoClassDefFoundError: 
 com/google/common/util/concurrent/ThreadFactoryBuilder
 at org.apache.spark.util.Utils$.init(Utils.scala:636)
 at org.apache.spark.util.Utils$.clinit(Utils.scala)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65)
 at org.apache.spark.repl.Main$.main(Main.scala:30)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException: 
 com.google.common.util.concurrent.ThreadFactoryBuilder
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 13 more
 # Check the assembly jar file
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
 grep -i ThreadFactoryBuilder
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
 {code}
 SBT build is fine since we don't shade Guava with SBT right now (and that's 
 why Jenkins didn't complain about this).
 Possible solutions can be:
 # revert PR #1813 for safe, or
 # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
 Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3178) setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero

2014-08-25 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110116#comment-14110116
 ] 

Helena Edelson commented on SPARK-3178:
---

+1 it doesn't look like the input data is validated to fail fast if mb/g is not 
noted

 setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the 
 worker memory limit to zero
 

 Key: SPARK-3178
 URL: https://issues.apache.org/jira/browse/SPARK-3178
 Project: Spark
  Issue Type: Bug
 Environment: osx
Reporter: Jon Haddad

 This should either default to m or just completely fail.  Starting a worker 
 with zero memory isn't very helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3218) K-Means clusterer can fail on degenerate data

2014-08-25 Thread Derrick Burns (JIRA)
Derrick Burns created SPARK-3218:


 Summary: K-Means clusterer can fail on degenerate data
 Key: SPARK-3218
 URL: https://issues.apache.org/jira/browse/SPARK-3218
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns


The KMeans parallel implementation selects points to be cluster centers with 
probability weighted by their distance to cluster centers.  However, if there 
are fewer than k DISTINCT points in the data set, this approach will fail.  

Further, the recent checkin to work around this problem results in selection of 
the same point repeatedly as a cluster center. 

The fix is to allow fewer than k cluster centers to be selected.  This requires 
several changes to the code, as the number of cluster centers is woven into the 
implementation.

I have a version of the code that addresses this problem, AND generalizes the 
distance metric.  However, I see that there are literally hundreds of 
outstanding pull requests.  If someone will commit to working with me to 
sponsor the pull request, I will create it.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3193) output errer info when Process exitcode not zero

2014-08-25 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei reopened SPARK-3193:



 output errer info when Process exitcode not zero
 

 Key: SPARK-3193
 URL: https://issues.apache.org/jira/browse/SPARK-3193
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: wangfei

 I noticed that sometimes pr tests failed due to the Process exitcode != 0:
 DriverSuite: 
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath 
 - driver should exit after finishing *** FAILED *** 
SparkException was thrown during property evaluation. 
 (DriverSuite.scala:40) 
  Message: Process List(./bin/spark-class, 
 org.apache.spark.DriverWithoutCleanup, local) exited with code 1 
  Occurred at table row 0 (zero based, not counting headings), which had 
 values ( 
master = local 
  ) 
  
 [info] SparkSubmitSuite:
 [info] - prints usage on empty input
 [info] - prints usage with only --help
 [info] - prints error with unrecognized options
 [info] - handle binary specified but not class
 [info] - handles arguments with --key=val
 [info] - handles arguments to user program
 [info] - handles arguments to user program with name collision
 [info] - handles YARN cluster mode
 [info] - handles YARN client mode
 [info] - handles standalone cluster mode
 [info] - handles standalone client mode
 [info] - handles mesos client mode
 [info] - handles confs with flag equivalents
 [info] - launch simple application with spark-submit *** FAILED ***
 [info]   org.apache.spark.SparkException: Process List(./bin/spark-submit, 
 --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, 
 --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited 
 with code 1
 [info]   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872)
 [info]   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
 [info]   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
 [info]   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
 [info]   at org.apacSpark assembly has been built with Hive, including 
 Datanucleus jars on classpath
 refer to 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull
 we should output the process error info when failed, this can be helpful for 
 diagnosis.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2921) Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other things)

2014-08-25 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110127#comment-14110127
 ] 

Cheng Lian commented on SPARK-2921:
---

[~andrewor14] {{spark.executor.extraLibraryPath}} is affected. But 
{{spark.executor.extraClassPath}} should be OK since it's finally added to the 
environment variable {{SPARK_CLASSPATH}}. 

 Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other 
 things)
 ---

 Key: SPARK-2921
 URL: https://issues.apache.org/jira/browse/SPARK-2921
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.2
Reporter: Andrew Or
Priority: Blocker
 Fix For: 1.1.0


 The code path to handle this exists only for the coarse grained mode, and 
 even in this mode the java options aren't passed to the executors properly. 
 We currently pass the entire value of spark.executor.extraJavaOptions to the 
 executors as a string without splitting it. We need to use 
 Utils.splitCommandString as in standalone mode.
 I have not confirmed this, but I would assume spark.executor.extraClassPath 
 and spark.executor.extraLibraryPath are also not propagated correctly in 
 either mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3219) K-Means clusterer should support Bregman distance metrics

2014-08-25 Thread Derrick Burns (JIRA)
Derrick Burns created SPARK-3219:


 Summary: K-Means clusterer should support Bregman distance metrics
 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns


The K-Means clusterer supports the Euclidean distance metric.  However, it is 
rather straightforward to support Bregman 
(http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
distance functions which would increase the utility of the clusterer 
tremendously.

I have modified the clusterer to support pluggable distance functions.  
However, I notice that there are hundreds of outstanding pull requests.  If 
someone is willing to work with me to sponsor the work through the process, I 
will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3220) K-Means clusterer should perform K-Means initialization in parallel

2014-08-25 Thread Derrick Burns (JIRA)
Derrick Burns created SPARK-3220:


 Summary: K-Means clusterer should perform K-Means initialization 
in parallel
 Key: SPARK-3220
 URL: https://issues.apache.org/jira/browse/SPARK-3220
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns


The LocalKMeans method should be replaced with a parallel implementation.  As 
it stands now, it becomes a bottleneck for large data sets. 

I have implemented this functionality in my version of the clusterer.  However, 
I see that there are hundreds of outstanding pull requests.  If someone on the 
team wants to sponsor the pull request, I will create one.  Otherwise, I will 
just maintain my own private fork of the clusterer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build

2014-08-25 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110142#comment-14110142
 ] 

Cheng Lian commented on SPARK-3217:
---

[~vanzin] Thanks, I did set {{SPARK_PREPEND_CLASSES}}. Will change the title 
and description of this issue after verifying it.

 Shaded Guava jar doesn't play well with Maven build
 ---

 Key: SPARK-3217
 URL: https://issues.apache.org/jira/browse/SPARK-3217
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Cheng Lian
Priority: Blocker

 PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
 and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
 built by Maven. But code in {{org.apache.spark.util.Utils}} still refers to 
 classes (e.g. {{ThreadFactoryBuilder}}) in package {{com.google.common}}.
 The result is that, when Spark is built with Maven (or 
 {{make-distribution.sh}}), commands like {{bin/spark-shell}} throws 
 {{ClassNotFoundException}}:
 {code}
 # Build Spark with Maven
 $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
 ...
 # Then spark-shell complains
 $ ./bin/spark-shell
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Exception in thread main java.lang.NoClassDefFoundError: 
 com/google/common/util/concurrent/ThreadFactoryBuilder
 at org.apache.spark.util.Utils$.init(Utils.scala:636)
 at org.apache.spark.util.Utils$.clinit(Utils.scala)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:134)
 at org.apache.spark.repl.SparkILoop.init(SparkILoop.scala:65)
 at org.apache.spark.repl.Main$.main(Main.scala:30)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException: 
 com.google.common.util.concurrent.ThreadFactoryBuilder
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 13 more
 # Check the assembly jar file
 $ jar tf 
 assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
 grep -i ThreadFactoryBuilder
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
 org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
 {code}
 SBT build is fine since we don't shade Guava with SBT right now (and that's 
 why Jenkins didn't complain about this).
 Possible solutions can be:
 # revert PR #1813 for safe, or
 # also shade Guava in SBT build and only use {{org.spark-project.guava}} in 
 Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3213) spark_ec2.py cannot find slave instances launched with Launch More Like This

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110149#comment-14110149
 ] 

Vida Ha edited comment on SPARK-3213 at 8/26/14 1:49 AM:
-

Hi Joseph,

Can you tell me more about how you launched these, without copying the tags?  I 
used Launch More Like This, and the name and tags were copied over correctly. 
 I'm wondering if maybe when you were using EC2, if perhaps you could have been 
so unlucky as to have trigger a temporary outage in copying tags...

Let's sync up in person tomorrow and figure out if this was a one time problem 
or happens each time Launch More Like This is used.






was (Author: vidaha):
Hi Joseph,

Can you tell me more about how you launched these, without copying the tags?  I 
used Launch More Like This, and the name and tags were copied over correctly. 
 I'm wondering if maybe when you were using EC2, if perhaps you could have been 
so unlucky as to have trigger a temporary outage in copying tags...

Let's sync up in person tomorrow and figure out if this was a one time problem 
or happens each time Launch 





 spark_ec2.py cannot find slave instances launched with Launch More Like This
 --

 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker

 spark_ec2.py cannot find all slave instances.  In particular:
 * I created a master  slave and configured them.
 * I created new slave instances from the original slave (Launch More Like 
 This).
 * I tried to relaunch the cluster, and it could only find the original slave.
 Old versions of the script worked.  The latest working commit which edited 
 that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
 There may be a problem with this PR: 
 [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3213) spark_ec2.py cannot find slave instances launched with Launch More Like This

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110149#comment-14110149
 ] 

Vida Ha commented on SPARK-3213:


Hi Joseph,

Can you tell me more about how you launched these, without copying the tags?  I 
used Launch More Like This, and the name and tags were copied over correctly. 
 I'm wondering if maybe when you were using EC2, if perhaps you could have been 
so unlucky as to have trigger a temporary outage in copying tags...

Let's sync up in person tomorrow and figure out if this was a one time problem 
or happens each time Launch 





 spark_ec2.py cannot find slave instances launched with Launch More Like This
 --

 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker

 spark_ec2.py cannot find all slave instances.  In particular:
 * I created a master  slave and configured them.
 * I created new slave instances from the original slave (Launch More Like 
 This).
 * I tried to relaunch the cluster, and it could only find the original slave.
 Old versions of the script worked.  The latest working commit which edited 
 that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
 There may be a problem with this PR: 
 [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3213) spark_ec2.py cannot find slave instances launched with Launch More Like This

2014-08-25 Thread Vida Ha (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vida Ha updated SPARK-3213:
---

Attachment: Screen Shot 2014-08-25 at 6.45.35 PM.png

 spark_ec2.py cannot find slave instances launched with Launch More Like This
 --

 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker
 Attachments: Screen Shot 2014-08-25 at 6.45.35 PM.png


 spark_ec2.py cannot find all slave instances.  In particular:
 * I created a master  slave and configured them.
 * I created new slave instances from the original slave (Launch More Like 
 This).
 * I tried to relaunch the cluster, and it could only find the original slave.
 Old versions of the script worked.  The latest working commit which edited 
 that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
 There may be a problem with this PR: 
 [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3213) spark_ec2.py cannot find slave instances launched with Launch More Like This

2014-08-25 Thread Vida Ha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110149#comment-14110149
 ] 

Vida Ha edited comment on SPARK-3213 at 8/26/14 1:51 AM:
-

Hi Joseph,

Can you tell me more about how you launched these, without copying the tags?  I 
used Launch More Like This, and the name and tags were copied over correctly 
- see my screenshot above.  I'm wondering if maybe when you were using EC2, if 
perhaps you could have been so unlucky as to have trigger a temporary outage in 
copying tags...

Let's sync up in person tomorrow and figure out if this was a one time problem 
or happens each time Launch More Like This is used or perhaps if we used 
different ways to launch more slaves.






was (Author: vidaha):
Hi Joseph,

Can you tell me more about how you launched these, without copying the tags?  I 
used Launch More Like This, and the name and tags were copied over correctly. 
 I'm wondering if maybe when you were using EC2, if perhaps you could have been 
so unlucky as to have trigger a temporary outage in copying tags...

Let's sync up in person tomorrow and figure out if this was a one time problem 
or happens each time Launch More Like This is used.





 spark_ec2.py cannot find slave instances launched with Launch More Like This
 --

 Key: SPARK-3213
 URL: https://issues.apache.org/jira/browse/SPARK-3213
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Blocker
 Attachments: Screen Shot 2014-08-25 at 6.45.35 PM.png


 spark_ec2.py cannot find all slave instances.  In particular:
 * I created a master  slave and configured them.
 * I created new slave instances from the original slave (Launch More Like 
 This).
 * I tried to relaunch the cluster, and it could only find the original slave.
 Old versions of the script worked.  The latest working commit which edited 
 that .py script is: a0bcbc159e89be868ccc96175dbf1439461557e1
 There may be a problem with this PR: 
 [https://github.com/apache/spark/pull/1899].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >