date:20140624


 [ 
https://issues.apache.org/jira/browse/SPARK-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2252.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

 mathjax doesn't work in HTTPS (math formulas not rendered)
 --

 Key: SPARK-2252
 URL: https://issues.apache.org/jira/browse/SPARK-2252
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.0.1, 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2249) how to convert rows of schemaRdd into HashMaps

2014-06-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041766#comment-14041766
 ] 

Sean Owen commented on SPARK-2249:
--

This is where issues are reported, rather than where questions are asked. I 
think this should be closed. Use the user@ mailing list instead.

 how to convert rows of schemaRdd into HashMaps
 --

 Key: SPARK-2249
 URL: https://issues.apache.org/jira/browse/SPARK-2249
 Project: Spark
  Issue Type: Question
Reporter: jackielihf

 spark 1.0.0
 how to convert rows of schemaRdd into HashMaps using column names as keys?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell

2014-06-24 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041789#comment-14041789
 ] 

Prashant Sharma commented on SPARK-1199:


One work around is to use `:paste` command of repl to work with these kind of 
scenarios. So if you use :paste and put the whole thing at once it will work 
nicely. I am just mentioning it because I found it, we also have a slightly 
better fix on github PR. 

 Type mismatch in Spark shell when using case class defined in shell
 ---

 Key: SPARK-1199
 URL: https://issues.apache.org/jira/browse/SPARK-1199
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Andrew Kerr
Assignee: Prashant Sharma
Priority: Blocker

 Define a class in the shell:
 {code}
 case class TestClass(a:String)
 {code}
 and an RDD
 {code}
 val data = sc.parallelize(Seq(a)).map(TestClass(_))
 {code}
 define a function on it and map over the RDD
 {code}
 def itemFunc(a:TestClass):TestClass = a
 data.map(itemFunc)
 {code}
 Error:
 {code}
 console:19: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   data.map(itemFunc)
 {code}
 Similarly with a mapPartitions:
 {code}
 def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a
 data.mapPartitions(partitionFunc)
 {code}
 {code}
 console:19: error: type mismatch;
  found   : Iterator[TestClass] = Iterator[TestClass]
  required: Iterator[TestClass] = Iterator[?]
 Error occurred in an application involving default arguments.
   data.mapPartitions(partitionFunc)
 {code}
 The behavior is the same whether in local mode or on a cluster.
 This isn't specific to RDDs. A Scala collection in the Spark shell has the 
 same problem.
 {code}
 scala Seq(TestClass(foo)).map(itemFunc)
 console:15: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   Seq(TestClass(foo)).map(itemFunc)
 ^
 {code}
 When run in the Scala console (not the Spark shell) there are no type 
 mismatch errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2125) Add sorting flag to ShuffleManager, and implement it in HashShuffleManager

2014-06-24 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041809#comment-14041809
 ] 

Saisai Shao commented on SPARK-2125:


Hi Matei, I have some basic implementations about this sub-task, would you mind 
assign this task to me? Thanks a lot.

 Add sorting flag to ShuffleManager, and implement it in HashShuffleManager
 --

 Key: SPARK-2125
 URL: https://issues.apache.org/jira/browse/SPARK-2125
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2253) Disable partial aggregation automatically when reduction factor is low


 [ 
https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2253:
---

Description: 
Once we see enough number of rows in partial aggregation and don't observe any 
reduction, Aggregator should just turn off partial aggregation. This reduces 
memory usage for high cardinality aggregations.

This one is for Spark core. There is another ticket tracking this for SQL. 

  was:
Once we see enough number of rows in partial aggregation and don't observe any 
reduction, Aggregator should just turn off partial aggregation.

This one is for Spark core. There is another ticket tracking this for SQL. 


 Disable partial aggregation automatically when reduction factor is low
 --

 Key: SPARK-2253
 URL: https://issues.apache.org/jira/browse/SPARK-2253
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 Once we see enough number of rows in partial aggregation and don't observe 
 any reduction, Aggregator should just turn off partial aggregation. This 
 reduces memory usage for high cardinality aggregations.
 This one is for Spark core. There is another ticket tracking this for SQL. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2125) Add sorting flag to ShuffleManager, and implement it in HashShuffleManager

2014-06-24 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041809#comment-14041809
 ] 

Saisai Shao edited comment on SPARK-2125 at 6/24/14 7:26 AM:
-

Hi Matei, I have some basic implementations about this sub-task, would you mind 
assigning this task to me? Thanks a lot.


was (Author: jerryshao):
Hi Matei, I have some basic implementations about this sub-task, would you mind 
assign this task to me? Thanks a lot.

 Add sorting flag to ShuffleManager, and implement it in HashShuffleManager
 --

 Key: SPARK-2125
 URL: https://issues.apache.org/jira/browse/SPARK-2125
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2255) create a self destructing iterator that release key value pairs from in-memory hash maps

Reynold Xin created SPARK-2255:
--

 Summary: create a self destructing iterator that release key value 
pairs from in-memory hash maps
 Key: SPARK-2255
 URL: https://issues.apache.org/jira/browse/SPARK-2255
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


This is a small thing to do that can help out with GC pressure. For 
aggregations (and potentially joins), we don't really need to hold onto the key 
value pairs as soon as we have iterate over them. We can create a self 
destructing iterator for AppendOnlyMap / ExternalAppendOnlyMap that removes 
references to the key value pair as the iterator goes through records so those 
memory can be freed quickly.





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2254) ScalaRefection should mark primitive types as non-nullable.

2014-06-24 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-2254:


 Summary: ScalaRefection should mark primitive types as 
non-nullable.
 Key: SPARK-2254
 URL: https://issues.apache.org/jira/browse/SPARK-2254
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2255) create a self destructing iterator that releases records from hash maps


 [ 
https://issues.apache.org/jira/browse/SPARK-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2255:
---

Summary: create a self destructing iterator that releases records from hash 
maps  (was: create a self destructing iterator that release key value pairs 
from in-memory hash maps)

 create a self destructing iterator that releases records from hash maps
 ---

 Key: SPARK-2255
 URL: https://issues.apache.org/jira/browse/SPARK-2255
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin

 This is a small thing to do that can help out with GC pressure. For 
 aggregations (and potentially joins), we don't really need to hold onto the 
 key value pairs as soon as we have iterate over them. We can create a self 
 destructing iterator for AppendOnlyMap / ExternalAppendOnlyMap that removes 
 references to the key value pair as the iterator goes through records so 
 those memory can be freed quickly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...

2014-06-24 Thread JIRA

Ángel Álvarez created SPARK-2256:


 Summary: pyspark: RDD.take doesn't work ... sometimes ...
 Key: SPARK-2256
 URL: https://issues.apache.org/jira/browse/SPARK-2256
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: local file/remote HDFS
Reporter: Ángel Álvarez


If I try to take some lines from a file, sometimes it doesn't work

Code: 

myfile = sc.textFile(A_ko)
print myfile.take(10)

Stacktrace:

14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
Traceback (most recent call last):
  File mytest.py, line 19, in module
print myfile.take(10)
  File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take
iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
  File 
spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, 
line 537, in __call__
  File 
spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, line 
300, in get_return_value

Test data:

START TEST DATA
A
A
A

[jira] [Commented] (SPARK-2248) spark.default.parallelism does not apply in local mode

2014-06-24 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041838#comment-14041838
 ] 

Guoqiang Li commented on SPARK-2248:


PR: https://github.com/apache/spark/pull/1194

 spark.default.parallelism does not apply in local mode
 --

 Key: SPARK-2248
 URL: https://issues.apache.org/jira/browse/SPARK-2248
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Guoqiang Li
Priority: Trivial
  Labels: Starter

 LocalBackend.defaultParallelism ignores the spark.default.parallelism 
 property, unlike the other SchedulerBackends. We should make it take this in 
 for consistency.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2125) Add sorting flag to ShuffleManager, and implement it in HashShuffleManager


 [ 
https://issues.apache.org/jira/browse/SPARK-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2125:
---

Assignee: Saisai Shao

 Add sorting flag to ShuffleManager, and implement it in HashShuffleManager
 --

 Key: SPARK-2125
 URL: https://issues.apache.org/jira/browse/SPARK-2125
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Matei Zaharia
Assignee: Saisai Shao





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2254) ScalaRefection should mark primitive types as non-nullable.

2014-06-24 Thread Takuya Ueshin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041837#comment-14041837
 ] 

Takuya Ueshin commented on SPARK-2254:
--

PRed: https://github.com/apache/spark/pull/1193

 ScalaRefection should mark primitive types as non-nullable.
 ---

 Key: SPARK-2254
 URL: https://issues.apache.org/jira/browse/SPARK-2254
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-2249) how to convert rows of schemaRdd into HashMaps


 [ 
https://issues.apache.org/jira/browse/SPARK-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-2249.
--

Resolution: Not a Problem

 how to convert rows of schemaRdd into HashMaps
 --

 Key: SPARK-2249
 URL: https://issues.apache.org/jira/browse/SPARK-2249
 Project: Spark
  Issue Type: Question
Reporter: jackielihf

 spark 1.0.0
 how to convert rows of schemaRdd into HashMaps using column names as keys?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2244) pyspark - RDD action hangs (after previously succeeding)


 [ 
https://issues.apache.org/jira/browse/SPARK-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2244:
---

Description: 
{code}
$ ./dist/bin/pyspark
Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
Type help, copyright, credits or license for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
  /_/

Using Python version 2.7.5 (default, Feb 19 2014 13:47:28)
SparkContext available as sc.
 hundy = sc.parallelize(range(100))
 hundy.count()
100
 hundy.count()
100
 hundy.count()
100
[repeat until hang, ctrl-C to get]
 hundy.count()
^CTraceback (most recent call last):
  File stdin, line 1, in module
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
line 774, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
line 765, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
line 685, in reduce
vals = self.mapPartitions(func).collect()
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
line 649, in collect
bytesInJava = self._jrdd.collect().iterator()
  File 
/home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 535, in __call__
  File 
/home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 363, in send_command
  File 
/home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 472, in send_command
  File /usr/lib64/python2.7/socket.py, line 430, in readline
data = recv(1)
KeyboardInterrupt
{code}

  was:
$ ./dist/bin/pyspark
Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
Type help, copyright, credits or license for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
  /_/

Using Python version 2.7.5 (default, Feb 19 2014 13:47:28)
SparkContext available as sc.
 hundy = sc.parallelize(range(100))
 hundy.count()
100
 hundy.count()
100
 hundy.count()
100
[repeat until hang, ctrl-C to get]
 hundy.count()
^CTraceback (most recent call last):
  File stdin, line 1, in module
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
line 774, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
line 765, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
line 685, in reduce
vals = self.mapPartitions(func).collect()
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
line 649, in collect
bytesInJava = self._jrdd.collect().iterator()
  File 
/home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 535, in __call__
  File 
/home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 363, in send_command
  File 
/home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 472, in send_command
  File /usr/lib64/python2.7/socket.py, line 430, in readline
data = recv(1)
KeyboardInterrupt



 pyspark - RDD action hangs (after previously succeeding)
 

 Key: SPARK-2244
 URL: https://issues.apache.org/jira/browse/SPARK-2244
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55
 code: sha b88238fa (master on 23 june 2014)
 cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running 
 locally)
Reporter: Matthew Farrellee
  Labels: openjdk, pyspark, python, shell, spark

 {code}
 $ ./dist/bin/pyspark
 Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
 [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
 Type help, copyright, credits or license for more information.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
   /_/
 Using Python version 2.7.5 (default, Feb 19 2014 13:47:28)
 SparkContext available as sc.
  hundy = sc.parallelize(range(100))
  hundy.count()
 100
  hundy.count()
 100
  hundy.count()
 100
 [repeat until hang, ctrl-C to get]
  hundy.count()
 ^CTraceback (most recent

[jira] [Commented] (SPARK-2247) data frames based on RDD


[ 
https://issues.apache.org/jira/browse/SPARK-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041841#comment-14041841
 ] 

Reynold Xin commented on SPARK-2247:


This would be a great feature to build out on Spark SQL actually. 

 data frames based on RDD 
 -

 Key: SPARK-2247
 URL: https://issues.apache.org/jira/browse/SPARK-2247
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core, SQL
Affects Versions: 1.0.0
Reporter: venu k tangirala
  Labels: features

 I would be nice to have R or python pandas like data frames on spark.
 1) To be able to access the RDD data frame from python with pandas 
 2) To be able to access the RDD data frame from R 
 3) To be able to access the RDD data frame from scala's saddle 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2247) data frames based on RDD


 [ 
https://issues.apache.org/jira/browse/SPARK-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2247:
---

Target Version/s:   (was: 1.0.0)

 data frames based on RDD 
 -

 Key: SPARK-2247
 URL: https://issues.apache.org/jira/browse/SPARK-2247
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core, SQL
Affects Versions: 1.0.0
Reporter: venu k tangirala
  Labels: features

 I would be nice to have R or python pandas like data frames on spark.
 1) To be able to access the RDD data frame from python with pandas 
 2) To be able to access the RDD data frame from R 
 3) To be able to access the RDD data frame from scala's saddle 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2242) Running sc.parallelize(..).count() hangs pyspark


 [ 
https://issues.apache.org/jira/browse/SPARK-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2242:
---

Assignee: Andrew Or

 Running sc.parallelize(..).count() hangs pyspark
 

 Key: SPARK-2242
 URL: https://issues.apache.org/jira/browse/SPARK-2242
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Andrew Or
Priority: Blocker

 Running the following code hangs pyspark in a shell:
 {code}
 sc.parallelize(range(100), 100).count()
 {code}
 It happens in the master branch, but not branch-1.0. And it seems that it 
 only happens in a pyspark shell. [~andrewor14] helped confirm this bug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2243) Using several Spark Contexts


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041844#comment-14041844
 ] 

Reynold Xin commented on SPARK-2243:


I don't think we currently support multiple SparkContext objects in the same 
JVM process. There are numerous assumptions in the code base that uses a a 
shared cache or thread local variables or some global identifiers which prevent 
us from using multiple SparkContext's. 

However, there is nothing fundamental here. We do want to support multiple 
SparkContext in the long run.

 Using several Spark Contexts
 

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 1.0.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at

[jira] [Updated] (SPARK-2244) pyspark - RDD action hangs (after previously succeeding)

2014-06-24 Thread Matthew Farrellee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee updated SPARK-2244:
-

Environment: 
system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55  1.8.0_05
code: sha b88238fa (master on 23 june 2014)
cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running 
locally)

  was:
system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55
code: sha b88238fa (master on 23 june 2014)
cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running 
locally)


 pyspark - RDD action hangs (after previously succeeding)
 

 Key: SPARK-2244
 URL: https://issues.apache.org/jira/browse/SPARK-2244
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55  
 1.8.0_05
 code: sha b88238fa (master on 23 june 2014)
 cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running 
 locally)
Reporter: Matthew Farrellee
  Labels: openjdk, pyspark, python, shell, spark

 {code}
 $ ./dist/bin/pyspark
 Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
 [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
 Type help, copyright, credits or license for more information.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
   /_/
 Using Python version 2.7.5 (default, Feb 19 2014 13:47:28)
 SparkContext available as sc.
  hundy = sc.parallelize(range(100))
  hundy.count()
 100
  hundy.count()
 100
  hundy.count()
 100
 [repeat until hang, ctrl-C to get]
  hundy.count()
 ^CTraceback (most recent call last):
   File stdin, line 1, in module
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 774, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 765, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 685, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 649, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 535, in __call__
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 363, in send_command
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 472, in send_command
   File /usr/lib64/python2.7/socket.py, line 430, in readline
 data = recv(1)
 KeyboardInterrupt
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2257) The algorithm of ALS in mlib lacks a parameter

2014-06-24 Thread zhengbing li (JIRA)

zhengbing li created SPARK-2257:
---

 Summary: The algorithm of ALS in mlib lacks a parameter 
 Key: SPARK-2257
 URL: https://issues.apache.org/jira/browse/SPARK-2257
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
 Environment: spark 1.0
Reporter: zhengbing li
 Fix For: 1.1.0



When I test ALS algorithm using netflix data, I find I cannot get the acurate 
results declared by the paper. The best  MSE value is 0.9066300038109709(RMSE 
0.952), which is worse than the paper's result. If I increase the number of 
features or the number of iterations, I will get a worse result. After I 
studing the paper and source code, I find a bug in the updateBlock function of 
ALS.

orgin code is:
while (i  rank) {
// ---
   fullXtX.data(i * rank + i) += lambda

i += 1
  }

The code doesn't consider the number of products that one user rates. So this 
code should be modified:
while (i  rank) {
 
//ratingsNum(index) equals the number of products that a user rates
fullXtX.data(i * rank + i) += lambda * ratingsNum(index)
i += 1
  } 

After I modify code, the MSE value has been improved, this is one test result
conditions:
val numIterations =20
val features = 30
val model = ALS.train(trainRatings,features, numIterations, 0.06)

result of modified version:
MSE: Double = 0.8472313396478773
RMSE: 0.92045


results of version of 1.0
MSE: Double = 1.2680743123043832
RMSE: 1.1261

In order to add the vector ratingsNum, I want to change the InLinkBlock 
structure as follows:
private[recommendation] case class InLinkBlock(elementIds: Array[Int], 
ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], 
Array[Double])]])
So I could calculte the vector ratingsNum in the function of makeInLinkBlock. 
This is the code I add in the makeInLinkBlock:

... 
//added 
  val ratingsNum = new Array[Int](numUsers)
   ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1)
//end of added
  InLinkBlock(userIds, ratingsNum, ratingsForBlock)



Is this solution reasonable??



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2257) The algorithm of ALS in mlib lacks a parameter

2014-06-24 Thread zhengbing li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengbing li updated SPARK-2257:


Description: 
When I test ALS algorithm using netflix data, I find I cannot get the acurate 
results declared by the paper. The best  MSE value is 0.9066300038109709(RMSE 
0.952), which is worse than the paper's result. If I increase the number of 
features or the number of iterations, I will get a worse result. After I 
studing the paper and source code, I find a bug in the updateBlock function of 
ALS.

orgin code is:
while (i  rank) {
// ---
   fullXtX.data(i * rank + i) += lambda

i += 1
  }

The code doesn't consider the number of products that one user rates. So this 
code should be modified:
while (i  rank) {
 
//ratingsNum(index) equals the number of products that a user rates
fullXtX.data(i * rank + i) += lambda * ratingsNum(index)
i += 1
  } 

After I modify code, the MSE value has been decreased, this is one test result
conditions:
val numIterations =20
val features = 30
val model = ALS.train(trainRatings,features, numIterations, 0.06)

result of modified version:
MSE: Double = 0.8472313396478773
RMSE: 0.92045


results of version of 1.0
MSE: Double = 1.2680743123043832
RMSE: 1.1261

In order to add the vector ratingsNum, I want to change the InLinkBlock 
structure as follows:
private[recommendation] case class InLinkBlock(elementIds: Array[Int], 
ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], 
Array[Double])]])
So I could calculte the vector ratingsNum in the function of makeInLinkBlock. 
This is the code I add in the makeInLinkBlock:

... 
//added 
  val ratingsNum = new Array[Int](numUsers)
   ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1)
//end of added
  InLinkBlock(userIds, ratingsNum, ratingsForBlock)



Is this solution reasonable??

  was:

When I test ALS algorithm using netflix data, I find I cannot get the acurate 
results declared by the paper. The best  MSE value is 0.9066300038109709(RMSE 
0.952), which is worse than the paper's result. If I increase the number of 
features or the number of iterations, I will get a worse result. After I 
studing the paper and source code, I find a bug in the updateBlock function of 
ALS.

orgin code is:
while (i  rank) {
// ---
   fullXtX.data(i * rank + i) += lambda

i += 1
  }

The code doesn't consider the number of products that one user rates. So this 
code should be modified:
while (i  rank) {
 
//ratingsNum(index) equals the number of products that a user rates
fullXtX.data(i * rank + i) += lambda * ratingsNum(index)
i += 1
  } 

After I modify code, the MSE value has been improved, this is one test result
conditions:
val numIterations =20
val features = 30
val model = ALS.train(trainRatings,features, numIterations, 0.06)

result of modified version:
MSE: Double = 0.8472313396478773
RMSE: 0.92045


results of version of 1.0
MSE: Double = 1.2680743123043832
RMSE: 1.1261

In order to add the vector ratingsNum, I want to change the InLinkBlock 
structure as follows:
private[recommendation] case class InLinkBlock(elementIds: Array[Int], 
ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], 
Array[Double])]])
So I could calculte the vector ratingsNum in the function of makeInLinkBlock. 
This is the code I add in the makeInLinkBlock:

... 
//added 
  val ratingsNum = new Array[Int](numUsers)
   ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1)
//end of added
  InLinkBlock(userIds, ratingsNum, ratingsForBlock)



Is this solution reasonable??


 The algorithm of ALS in mlib lacks a parameter 
 ---

 Key: SPARK-2257
 URL: https://issues.apache.org/jira/browse/SPARK-2257
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
 Environment: spark 1.0
Reporter: zhengbing li
  Labels: patch
 Fix For: 1.1.0

   Original Estimate: 336h
  Remaining Estimate: 336h

 When I test ALS algorithm using netflix data, I find I cannot get the acurate 
 results declared by the paper. The best  MSE value is 0.9066300038109709(RMSE 
 0.952), which is worse than the paper's result. If I increase the number of 
 features or the number of iterations, I will get a worse result. After I 
 studing the paper and source code, I find a bug in the updateBlock function 
 of ALS.
 orgin code is:
 while (i  rank) {
 // ---
fullXtX.data(i * rank + i) += lambda
 i += 1
   }
 The code doesn't consider the number of products that one user rates. So this 
 code should be modified:
 while (i  rank) {
  
 //ratingsNum(index) equals the number of products that a user rates

[jira] [Commented] (SPARK-2257) The algorithm of ALS in mlib lacks a parameter

2014-06-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042025#comment-14042025
 ] 

Sean Owen commented on SPARK-2257:
--

I don't think this is a bug, in the sense that it is just a different 
formulation of ALS. It's in the ALS-WR paper, but not the more well-known 
Hu/Koren/Volinsky paper. 

This is weighted regularization and it does help in some cases. In fact, it's 
already implemented in MLlib, although went in just after 1.0.0:

https://github.com/apache/spark/commit/a6e0afdcf0174425e8a6ff20b2bc2e3a7a374f19#diff-2b593e0b4bd6eddab37f04968baa826c

I think this is therefore already implemented.

 The algorithm of ALS in mlib lacks a parameter 
 ---

 Key: SPARK-2257
 URL: https://issues.apache.org/jira/browse/SPARK-2257
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
 Environment: spark 1.0
Reporter: zhengbing li
  Labels: patch
 Fix For: 1.1.0

   Original Estimate: 336h
  Remaining Estimate: 336h

 When I test ALS algorithm using netflix data, I find I cannot get the acurate 
 results declared by the paper. The best  MSE value is 0.9066300038109709(RMSE 
 0.952), which is worse than the paper's result. If I increase the number of 
 features or the number of iterations, I will get a worse result. After I 
 studing the paper and source code, I find a bug in the updateBlock function 
 of ALS.
 orgin code is:
 while (i  rank) {
 // ---
fullXtX.data(i * rank + i) += lambda
 i += 1
   }
 The code doesn't consider the number of products that one user rates. So this 
 code should be modified:
 while (i  rank) {
  
 //ratingsNum(index) equals the number of products that a user rates
 fullXtX.data(i * rank + i) += lambda * ratingsNum(index)
 i += 1
   } 
 After I modify code, the MSE value has been decreased, this is one test result
 conditions:
 val numIterations =20
 val features = 30
 val model = ALS.train(trainRatings,features, numIterations, 0.06)
 result of modified version:
 MSE: Double = 0.8472313396478773
 RMSE: 0.92045
 results of version of 1.0
 MSE: Double = 1.2680743123043832
 RMSE: 1.1261
 In order to add the vector ratingsNum, I want to change the InLinkBlock 
 structure as follows:
 private[recommendation] case class InLinkBlock(elementIds: Array[Int], 
 ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], 
 Array[Double])]])
 So I could calculte the vector ratingsNum in the function of makeInLinkBlock. 
 This is the code I add in the makeInLinkBlock:
 ... 
 //added 
   val ratingsNum = new Array[Int](numUsers)
ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1)
 //end of added
   InLinkBlock(userIds, ratingsNum, ratingsForBlock)
 
 Is this solution reasonable??



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-2257) The algorithm of ALS in mlib lacks a parameter

2014-06-24 Thread zhengbing li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengbing li closed SPARK-2257.
---

   Resolution: Fixed
Fix Version/s: 1.0.1

 The algorithm of ALS in mlib lacks a parameter 
 ---

 Key: SPARK-2257
 URL: https://issues.apache.org/jira/browse/SPARK-2257
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
 Environment: spark 1.0
Reporter: zhengbing li
  Labels: patch
 Fix For: 1.0.1, 1.1.0

   Original Estimate: 336h
  Remaining Estimate: 336h

 When I test ALS algorithm using netflix data, I find I cannot get the acurate 
 results declared by the paper. The best  MSE value is 0.9066300038109709(RMSE 
 0.952), which is worse than the paper's result. If I increase the number of 
 features or the number of iterations, I will get a worse result. After I 
 studing the paper and source code, I find a bug in the updateBlock function 
 of ALS.
 orgin code is:
 while (i  rank) {
 // ---
fullXtX.data(i * rank + i) += lambda
 i += 1
   }
 The code doesn't consider the number of products that one user rates. So this 
 code should be modified:
 while (i  rank) {
  
 //ratingsNum(index) equals the number of products that a user rates
 fullXtX.data(i * rank + i) += lambda * ratingsNum(index)
 i += 1
   } 
 After I modify code, the MSE value has been decreased, this is one test result
 conditions:
 val numIterations =20
 val features = 30
 val model = ALS.train(trainRatings,features, numIterations, 0.06)
 result of modified version:
 MSE: Double = 0.8472313396478773
 RMSE: 0.92045
 results of version of 1.0
 MSE: Double = 1.2680743123043832
 RMSE: 1.1261
 In order to add the vector ratingsNum, I want to change the InLinkBlock 
 structure as follows:
 private[recommendation] case class InLinkBlock(elementIds: Array[Int], 
 ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], 
 Array[Double])]])
 So I could calculte the vector ratingsNum in the function of makeInLinkBlock. 
 This is the code I add in the makeInLinkBlock:
 ... 
 //added 
   val ratingsNum = new Array[Int](numUsers)
ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1)
 //end of added
   InLinkBlock(userIds, ratingsNum, ratingsForBlock)
 
 Is this solution reasonable??



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2258) Worker UI displays zombie executors

Andrew Or created SPARK-2258:


 Summary: Worker UI displays zombie executors
 Key: SPARK-2258
 URL: https://issues.apache.org/jira/browse/SPARK-2258
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or


See attached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2258) Worker UI displays zombie executors


 [ 
https://issues.apache.org/jira/browse/SPARK-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2258:
-

Attachment: Screen Shot 2014-06-24 at 9.23.18 AM.png

 Worker UI displays zombie executors
 ---

 Key: SPARK-2258
 URL: https://issues.apache.org/jira/browse/SPARK-2258
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Attachments: Screen Shot 2014-06-24 at 9.23.18 AM.png


 See attached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1556) jets3t dep doesn't update properly with newer Hadoop versions

2014-06-24 Thread David Rosenstrauch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042332#comment-14042332
 ] 

David Rosenstrauch commented on SPARK-1556:
---

Thanks again for the tip.  I didn't seem to have the 
/usr/lib/spark/assembly/lib directory in my installation, but adding the 
symlink in /usr/lib/shark/lib did the trick as well.

 jets3t dep doesn't update properly with newer Hadoop versions
 -

 Key: SPARK-1556
 URL: https://issues.apache.org/jira/browse/SPARK-1556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
Reporter: Nan Zhu
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0


 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines 
 S3ServiceException/ServiceException is introduced, however, Spark still 
 relies on Jet3st 0.7.x which has no definition of these classes
 What I met is that 
 [code]
 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
   at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
   at

[jira] [Created] (SPARK-2259) Spark submit documentation for --deploy-mode is highly misleading

Andrew Or created SPARK-2259:


 Summary: Spark submit documentation for --deploy-mode is highly 
misleading
 Key: SPARK-2259
 URL: https://issues.apache.org/jira/browse/SPARK-2259
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
Priority: Critical


There are a few issues:

1. Client mode does not necessarily mean the driver program must be launched 
outside of the cluster.

2. For standalone clusters, only client mode is currently supported. This was 
the case supported even before 1.0.

Currently, the docs tell the user to use cluster deploy mode when deploying 
your driver program within the cluster, which is true also for 
standalone-client mode. In short, the docs encourage the user to use 
standalone-cluster, an unsupported mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on wrong executors


 [ 
https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2204:
---

Target Version/s: 1.0.1, 1.1.0

 Scheduler for Mesos in fine-grained mode launches tasks on wrong executors
 --

 Key: SPARK-2204
 URL: https://issues.apache.org/jira/browse/SPARK-2204
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Sebastien Rainville
Priority: Blocker

 MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is 
 assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning 
 task lists in the same order as the offers it was passed, but in the current 
 implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid 
 assigning the tasks always to the same executors. The result is that the 
 tasks are launched on the wrong executors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2260) Spark submit standalone-cluster mode is broken

Andrew Or created SPARK-2260:


 Summary: Spark submit standalone-cluster mode is broken
 Key: SPARK-2260
 URL: https://issues.apache.org/jira/browse/SPARK-2260
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or


Well, it is technically not officially supported... but we should still fix it.

In particular, important configs such as spark.master and the application jar 
are not propagated to the worker nodes properly, due to obvious missing pieces 
in the code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1937) Tasks can be submitted before executors are registered

2014-06-24 Thread Matei Zaharia (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matei Zaharia updated SPARK-1937:
-

Assignee: Rui Li

Tasks can be submitted before executors are registered
--

Key: SPARK-1937
URL: https://issues.apache.org/jira/browse/SPARK-1937
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rui Li
Assignee: Rui Li
Fix For: 1.1.0

Attachments: After-patch.PNG, Before-patch.png, RSBTest.scala

During construction, TaskSetManager will assign tasks to several pending
lists according to the tasks’ preferred locations. If the desired location is
unavailable, it’ll then assign this task to “pendingTasksWithNoPrefs”, a list
containing tasks without preferred locations.
The problem is that tasks may be submitted before the executors get
registered with the driver, in which case TaskSetManager will assign all the
tasks to pendingTasksWithNoPrefs. Later when it looks for a task to schedule,
it will pick one from this list and assign it to arbitrary executor, since
TaskSetManager considers the tasks can run equally well on any node.
This problem deprives benefits of data locality, drags the whole job slow and
can cause imbalance between executors.
I ran into this issue when running a spark program on a 7-node cluster
(node6~node12). The program processes 100GB data.
Since the data is uploaded to HDFS from node6, this node has a complete copy
of the data and as a result, node6 finishes tasks much faster, which in turn
makes it complete dis-proportionally more tasks than other nodes.
To solve this issue, I think we shouldn't check availability of
executors/hosts when constructing TaskSetManager. If a task prefers a node,
we simply add the task to that node’s pending list. When later on the node is
added, TaskSetManager can schedule the task according to proper locality
level. If unfortunately the preferred node(s) never gets added,
TaskSetManager can still schedule the task at locality level “ANY”.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1937) Tasks can be submitted before executors are registered

2014-06-24 Thread Matei Zaharia (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matei Zaharia resolved SPARK-1937.
--

Resolution: Fixed
Fix Version/s: 1.1.0
Target Version/s: 1.1.0

Tasks can be submitted before executors are registered
--

Key: SPARK-1937
URL: https://issues.apache.org/jira/browse/SPARK-1937
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rui Li
Assignee: Rui Li
Fix For: 1.1.0

Attachments: After-patch.PNG, Before-patch.png, RSBTest.scala

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1911) Warn users that jars should be built with Java 6 for YARN


 [ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1911:
-

Summary: Warn users that jars should be built with Java 6 for YARN  (was: 
Warn users that jars should be built with Java 6 for PySpark to work on YARN)

 Warn users that jars should be built with Java 6 for YARN
 -

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Andrew Or
 Fix For: 1.0.0


 Python sometimes fails to read jars created by Java 7. This is necessary for 
 PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 
 for PySpark to work on YARN.
 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. We should emphasize it in the docs especially for PySpark 
 and YARN because this issue is not trivial to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Reopened] (SPARK-1911) Warn users that jars should be built with Java 6 for YARN


 [ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-1911:
--


Looks like we still haven't fixed this. At the very least we should explicitly 
add this warning in the documentation. The next step that is more involved is 
to add a warning prompt in the mvn build scripts whenever Java 7+ is detected.

 Warn users that jars should be built with Java 6 for YARN
 -

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 Python sometimes fails to read jars created by Java 7. This is necessary for 
 PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 
 for PySpark to work on YARN.
 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. We should emphasize it in the docs especially for PySpark 
 and YARN because this issue is not trivial to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1911) Warn users if their jars are not built with Java 6


 [ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1911:
-

Fix Version/s: (was: 1.0.0)
   1.1.0

 Warn users if their jars are not built with Java 6
 --

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. We should emphasize it in the docs especially for PySpark 
 and YARN because this issue is not trivial to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1911) Warn users if their jars are not built with Java 6


 [ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1911:
-

Affects Version/s: 1.1.0

 Warn users if their jars are not built with Java 6
 --

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. We should emphasize it in the docs especially for PySpark 
 and YARN because this issue is not trivial to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1911) Warn users if their jars are not built with Java 6


 [ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1911:
-

Description: 


Currently we warn users only in make-distribution.sh, but most users build the 
jars directly. We should emphasize it in the docs especially for PySpark and 
YARN because this issue is not trivial to debug.

  was:
Python sometimes fails to read jars created by Java 7. This is necessary for 
PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 
for PySpark to work on YARN.

Currently we warn users only in make-distribution.sh, but most users build the 
jars directly. We should emphasize it in the docs especially for PySpark and 
YARN because this issue is not trivial to debug.


 Warn users if their jars are not built with Java 6
 --

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. We should emphasize it in the docs especially for PySpark 
 and YARN because this issue is not trivial to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1911) Warn users if their assembly jars are not built with Java 6


 [ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1911:
-

Summary: Warn users if their assembly jars are not built with Java 6  (was: 
Warn users if their jars are not built with Java 6)

 Warn users if their assembly jars are not built with Java 6
 ---

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. We should emphasize it in the docs especially for PySpark 
 and YARN because this issue is not trivial to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1911) Warn users if their assembly jars are not built with Java 6

[
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or updated SPARK-1911:
-

Description:
The root cause of the problem is detailed in:
https://issues.apache.org/jira/browse/SPARK-1520.

Currently we warn users only in make-distribution.sh, but most users build the
jars directly. We should emphasize it in the docs especially for PySpark and
YARN because this issue is not trivial to debug.

was:

Warn users if their assembly jars are not built with Java 6
---

Key: SPARK-1911
URL: https://issues.apache.org/jira/browse/SPARK-1911
Project: Spark
Issue Type: Sub-task
Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
Fix For: 1.1.0

The root cause of the problem is detailed in:
https://issues.apache.org/jira/browse/SPARK-1520.
In short, an assembly jar built with Java 7+ is not always accessible by
Python or other versions of Java (especially Java 6). If the assembly jar is
not built on the cluster itself, this problem may manifest itself in strange
exceptions that are not trivial to debug.
Currently we warn users only in make-distribution.sh, but most users build
the jars directly. We should emphasize it in the docs especially for PySpark
and YARN because this issue is not trivial to debug.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1891) Add admin acls to the Web UI

2014-06-24 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042516#comment-14042516
 ] 

Thomas Graves commented on SPARK-1891:
--

https://github.com/apache/spark/pull/1196 combined pull request with SPARK-1890

 Add admin acls to the Web UI
 

 Key: SPARK-1891
 URL: https://issues.apache.org/jira/browse/SPARK-1891
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 1.1.0


 In some environments you have multi-tenant clusters.  In those environments 
 you sometimes have users, operations, support teams, and framework 
 development teams (Spark, MR, etc developers).  When users have issues with 
 their application they can ask for help from support teams or development.
 In those cases you want the support team or development teams to be able to 
 see the application UI so they can help the user debug what is going on 
 without having to rerun the application.   
 In order to more easily do that we can add a concept of admin acls to the 
 UI. This is the list of users who are considered admins for the cluster and 
 would always have permission to view the UI.  Generally this list would be 
 set at the cluster level and picked up by anyone running on that cluster.
 We should add admin acls to the Spark web ui to allow this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1911) Warn users if their jars are not built with Java 6


 [ 
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1911:
-

Summary: Warn users if their jars are not built with Java 6  (was: Warn 
users that jars should be built with Java 6)

 Warn users if their jars are not built with Java 6
 --

 Key: SPARK-1911
 URL: https://issues.apache.org/jira/browse/SPARK-1911
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 Python sometimes fails to read jars created by Java 7. This is necessary for 
 PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6 
 for PySpark to work on YARN.
 Currently we warn users only in make-distribution.sh, but most users build 
 the jars directly. We should emphasize it in the docs especially for PySpark 
 and YARN because this issue is not trivial to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1911) Warn users if their assembly jars are not built with Java 6

[
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or updated SPARK-1911:
-

Description:
The root cause of the problem is detailed in:
https://issues.apache.org/jira/browse/SPARK-1520.

In short, an assembly jar built with Java 7+ is not always accessible by Python
or other versions of Java (especially Java 6). If the assembly jar is not built
on the cluster itself, this problem may manifest itself in strange exceptions
that are not trivial to debug. This is an issue especially for PySpark on YARN,
which relies on the python files included within the assembly jar.

Currently we warn users only in make-distribution.sh, but most users build the
jars directly. At the very least we need to emphasize this in the docs
(currently missing entirely). The next step is to add a warning prompt in the
mvn scripts whenever Java 7+ is detected.

was:
The root cause of the problem is detailed in:
https://issues.apache.org/jira/browse/SPARK-1520.

Warn users if their assembly jars are not built with Java 6
---

Key: SPARK-1911
URL: https://issues.apache.org/jira/browse/SPARK-1911
Project: Spark
Issue Type: Sub-task
Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
Fix For: 1.1.0

The root cause of the problem is detailed in:
https://issues.apache.org/jira/browse/SPARK-1520.
In short, an assembly jar built with Java 7+ is not always accessible by
Python or other versions of Java (especially Java 6). If the assembly jar is
not built on the cluster itself, this problem may manifest itself in strange
exceptions that are not trivial to debug. This is an issue especially for
PySpark on YARN, which relies on the python files included within the
assembly jar.
Currently we warn users only in make-distribution.sh, but most users build
the jars directly. At the very least we need to emphasize this in the docs
(currently missing entirely). The next step is to add a warning prompt in the
mvn scripts whenever Java 7+ is detected.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2244) pyspark - RDD action hangs (after previously succeeding)

2014-06-24 Thread Matthew Farrellee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042523#comment-14042523
 ] 

Matthew Farrellee commented on SPARK-2244:
--

i have a theory - after a long bisect session the following commit was 
implicated -

3870248740d83b0292ccca88a494ce19783847f0 is the first bad commit
commit 3870248740d83b0292ccca88a494ce19783847f0
Author: Kay Ousterhout kayousterh...@gmail.com
Date:   Wed Jun 18 13:16:26 2014 -0700

in that commit stderr is captured into a PIPE for the first time

theory is the pipe is filling a buffer that is never drained, resulting in an 
eventual hang in communication.

testing this theory by adding an additional EchoOutputThread for proc.stderr, 
which appears to resolve the issue

i'll come up with an appropriate fix and send a pull request

 pyspark - RDD action hangs (after previously succeeding)
 

 Key: SPARK-2244
 URL: https://issues.apache.org/jira/browse/SPARK-2244
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55  
 1.8.0_05
 code: sha b88238fa (master on 23 june 2014)
 cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running 
 locally)
Reporter: Matthew Farrellee
  Labels: openjdk, pyspark, python, shell, spark

 {code}
 $ ./dist/bin/pyspark
 Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
 [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
 Type help, copyright, credits or license for more information.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
   /_/
 Using Python version 2.7.5 (default, Feb 19 2014 13:47:28)
 SparkContext available as sc.
  hundy = sc.parallelize(range(100))
  hundy.count()
 100
  hundy.count()
 100
  hundy.count()
 100
 [repeat until hang, ctrl-C to get]
  hundy.count()
 ^CTraceback (most recent call last):
   File stdin, line 1, in module
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 774, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 765, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 685, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 649, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 535, in __call__
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 363, in send_command
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 472, in send_command
   File /usr/lib64/python2.7/socket.py, line 430, in readline
 data = recv(1)
 KeyboardInterrupt
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2244) pyspark - RDD action hangs (after previously succeeding)


[ 
https://issues.apache.org/jira/browse/SPARK-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042529#comment-14042529
 ] 

Reynold Xin commented on SPARK-2244:


Is this related? https://github.com/apache/spark/pull/1178

 pyspark - RDD action hangs (after previously succeeding)
 

 Key: SPARK-2244
 URL: https://issues.apache.org/jira/browse/SPARK-2244
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55  
 1.8.0_05
 code: sha b88238fa (master on 23 june 2014)
 cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running 
 locally)
Reporter: Matthew Farrellee
  Labels: openjdk, pyspark, python, shell, spark

 {code}
 $ ./dist/bin/pyspark
 Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
 [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
 Type help, copyright, credits or license for more information.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
   /_/
 Using Python version 2.7.5 (default, Feb 19 2014 13:47:28)
 SparkContext available as sc.
  hundy = sc.parallelize(range(100))
  hundy.count()
 100
  hundy.count()
 100
  hundy.count()
 100
 [repeat until hang, ctrl-C to get]
  hundy.count()
 ^CTraceback (most recent call last):
   File stdin, line 1, in module
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 774, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 765, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 685, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 649, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 535, in __call__
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 363, in send_command
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 472, in send_command
   File /usr/lib64/python2.7/socket.py, line 430, in readline
 data = recv(1)
 KeyboardInterrupt
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2244) pyspark - RDD action hangs (after previously succeeding)

2014-06-24 Thread Matthew Farrellee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042536#comment-14042536
 ] 

Matthew Farrellee commented on SPARK-2244:
--

yes, but i prefer my solution - https://github.com/apache/spark/pull/1197

 pyspark - RDD action hangs (after previously succeeding)
 

 Key: SPARK-2244
 URL: https://issues.apache.org/jira/browse/SPARK-2244
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: system: fedora 20 w/ maven 3.1.1 and openjdk 1.7.0_55  
 1.8.0_05
 code: sha b88238fa (master on 23 june 2014)
 cluster: make-distribution.sh followed by ./dist/sbin/start-all.sh (running 
 locally)
Reporter: Matthew Farrellee
  Labels: openjdk, pyspark, python, shell, spark

 {code}
 $ ./dist/bin/pyspark
 Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
 [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
 Type help, copyright, credits or license for more information.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
   /_/
 Using Python version 2.7.5 (default, Feb 19 2014 13:47:28)
 SparkContext available as sc.
  hundy = sc.parallelize(range(100))
  hundy.count()
 100
  hundy.count()
 100
  hundy.count()
 100
 [repeat until hang, ctrl-C to get]
  hundy.count()
 ^CTraceback (most recent call last):
   File stdin, line 1, in module
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 774, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 765, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 685, in reduce
 vals = self.mapPartitions(func).collect()
   File /home/matt/Documents/Repositories/spark/dist/python/pyspark/rdd.py, 
 line 649, in collect
 bytesInJava = self._jrdd.collect().iterator()
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 535, in __call__
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 363, in send_command
   File 
 /home/matt/Documents/Repositories/spark/dist/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 472, in send_command
   File /usr/lib64/python2.7/socket.py, line 430, in readline
 data = recv(1)
 KeyboardInterrupt
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1718) pyspark doesn't work with assembly jar containing over 65536 files/dirs built on redhat

2014-06-24 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-1718.
--

Resolution: Duplicate

 pyspark doesn't work with assembly jar containing over 65536 files/dirs built 
 on redhat 
 

 Key: SPARK-1718
 URL: https://issues.apache.org/jira/browse/SPARK-1718
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Thomas Graves

 Recently pyspark was ported to yarn (pr 30), but when I went to try it I 
 couldn't get it work.  I was building on a redhat 6 box.  I figured out that 
 if the assembly jar file contained over 65536 files/directories then it 
 wouldn't work.  If I unjarred the assembly and removed some stuff to get it 
 under 65536 and jarred it back up, then it would work. 
 It appears to only be an issue when building on a redhat box as I can build 
 on my mac and it works just fine there.   



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2261) Spark application event logs are not very NameNode-friendly

2014-06-24 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-2261:
-

 Summary: Spark application event logs are not very 
NameNode-friendly
 Key: SPARK-2261
 URL: https://issues.apache.org/jira/browse/SPARK-2261
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin
Priority: Minor


Currently, EventLoggingListener will generate application logs using, in the 
worst case, five different entries in the file system:

* The directory to hold the files
* One file for the Spark version
* One file for the event logs
* One file to identify the compression codec of the event logs
* One file to say the application is finished.

It would be better to be more friendly to the NameNodes and use a single entry 
for all of those.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2262) Extreme Learning Machines (ELM) for MLLib

2014-06-24 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042731#comment-14042731
 ] 

Erik Erlandson commented on SPARK-2262:
---

I'd like to have this assigned to me

 Extreme Learning Machines (ELM) for MLLib
 -

 Key: SPARK-2262
 URL: https://issues.apache.org/jira/browse/SPARK-2262
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Erik Erlandson
  Labels: features

 MLLib has a gap in the NN space.   There's some good reason for this, as 
 batching gradient updates in traditional backprop training is known to not 
 perform well.
 However, Extreme Learning Machines(ELM)  combine support for nonlinear 
 activation functions in a hidden layer with a batch-friendly linear training. 
  There is also a body of ELM literature on various avenues for extension, 
 including multi-category classification, multiple hidden layers and adaptive 
 addition/deletion of hidden nodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2251) MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition

2014-06-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042738#comment-14042738
 ] 

Sean Owen commented on SPARK-2251:
--

For what it's worth, I can reproduce this. In the sample, the test RDD has 2 
partitions, containing 2 and 1 examples. The prediction RDD has 2 partitions, 
containing 1 and 2 examples respectively. So they aren't matched up, even 
though one is a 1-1 map() of the other. 

That seems like it shouldn't happen? maybe someone more knowledgeable can say 
whether that itself should occur. test is a PartitionwiseSampledRDD and 
prediction is a MappedRDD of course. 

If it is allowed to happen, then the example should be fixed, and I could 
easily supply a patch. It can be done without having to zip up RDDs to begin 
with.

 MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number 
 of elements in each partition
 --

 Key: SPARK-2251
 URL: https://issues.apache.org/jira/browse/SPARK-2251
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
 Environment: OS: Fedora Linux
 Spark Version: 1.0.0. Git clone from the Spark Repository
Reporter: Jun Xie
Priority: Minor
  Labels: Naive-Bayes

 I follow the exact code from Naive Bayes Example 
 (http://spark.apache.org/docs/latest/mllib-naive-bayes.html) of MLLib.
 When I executed the final command: 
 val accuracy = 1.0 * predictionAndLabel.filter(x = x._1 == x._2).count() / 
 test.count()
 It complains Can only zip RDDs with same number of elements in each 
 partition.
 I got the following exception:
 14/06/23 19:39:23 INFO SparkContext: Starting job: count at console:31
 14/06/23 19:39:23 INFO DAGScheduler: Got job 3 (count at console:31) with 2 
 output partitions (allowLocal=false)
 14/06/23 19:39:23 INFO DAGScheduler: Final stage: Stage 4(count at 
 console:31)
 14/06/23 19:39:23 INFO DAGScheduler: Parents of final stage: List()
 14/06/23 19:39:23 INFO DAGScheduler: Missing parents: List()
 14/06/23 19:39:23 INFO DAGScheduler: Submitting Stage 4 (FilteredRDD[14] at 
 filter at console:31), which has no missing parents
 14/06/23 19:39:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 
 (FilteredRDD[14] at filter at console:31)
 14/06/23 19:39:23 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:0 as TID 8 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:0 as 3410 bytes in 
 0 ms
 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:1 as TID 9 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:1 as 3410 bytes in 
 1 ms
 14/06/23 19:39:23 INFO Executor: Running task ID 8
 14/06/23 19:39:23 INFO Executor: Running task ID 9
 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally
 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24
 14/06/23 19:39:23 ERROR Executor: Exception in task ID 9
 org.apache.spark.SparkException: Can only zip RDDs with same number of 
 elements in each partition
   at 
 org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:663)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1067)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:724)
 14/06/23 19:39:23 ERROR Executor: Exception in task ID 8

[jira] [Created] (SPARK-2263) Can't insert MapK, V values to Hive tables

2014-06-24 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-2263:
-

 Summary: Can't insert MapK, V values to Hive tables
 Key: SPARK-2263
 URL: https://issues.apache.org/jira/browse/SPARK-2263
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian


Scala {{Map\[K, V\]}} values are not converted to their Java correspondence:

{code}
scala loadTestTable(src)

scala hql(create table m(value mapint, string))
res1: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[0] at RDD at SchemaRDD.scala:100
== Query Plan ==
Native command: executed by Hive

scala hql(insert overwrite table m select map(key, value) from src)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 
failed 1 times, most recent failure: Exception failure in TID 0 on host 
localhost: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 
cannot be cast to java.util.Map

org.apache.hadoop.hive.serde2.objectinspector.StandardMapObjectInspector.getMap(StandardMapObjectInspector.java:82)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:515)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$2$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$2$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1040)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1024)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1022)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1022)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:640)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:640)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:640)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1214)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


scala 
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2264) CachedTableSuite SQL Tests are Failing

Patrick Wendell created SPARK-2264:
--

 Summary: CachedTableSuite SQL Tests are Failing
 Key: SPARK-2264
 URL: https://issues.apache.org/jira/browse/SPARK-2264
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Michael Armbrust
Priority: Blocker


{code}
[info] CachedTableSuite:
[info] - read from cached table and uncache *** FAILED ***
[info]   java.lang.RuntimeException: Table Not Found: testData
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64)
[info]   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:185)
[info]   at 
org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply$mcV$sp(CachedTableSuite.scala:43)
[info]   at 
org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply(CachedTableSuite.scala:27)
[info]   at 
org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply(CachedTableSuite.scala:27)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
[info]   ...
[info] - correct error on uncache of non-cached table *** FAILED ***
[info]   Expected exception java.lang.IllegalArgumentException to be thrown, 
but java.lang.RuntimeException was thrown. (CachedTableSuite.scala:55)
[info] - SELECT Star Cached Table *** FAILED ***
[info]   java.lang.RuntimeException: Table Not Found: testData
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64)
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:67)
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:65)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
[info]   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
[info]   ...
[info] - Self-join cached *** FAILED ***
[info]   java.lang.RuntimeException: Table Not Found: testData
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64)
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:67)
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:65)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
[info]   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
[info]   ...
[info] - 'CACHE TABLE' and 'UNCACHE TABLE' SQL statement *** FAILED ***
[info]   java.lang.RuntimeException: Table Not Found: testData
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at 
org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64)
[info]   at org.apache.spark.sql.SQLContext.cacheTable(SQLContext.scala:189)
[info]   at 
org.apache.spark.sql.execution.CacheCommand.sideEffectResult$lzycompute(commands.scala:110)
[info]   at 
org.apache.spark.sql.execution.CacheCommand.sideEffectResult(commands.scala:108)
[info]   at 
org.apache.spark.sql.execution.CacheCommand.execute(commands.scala:118)
[info]   at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:322)
[info]   ...
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2264) CachedTableSuite SQL Tests are Failing

2014-06-24 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2264:


Component/s: SQL

 CachedTableSuite SQL Tests are Failing
 --

 Key: SPARK-2264
 URL: https://issues.apache.org/jira/browse/SPARK-2264
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Patrick Wendell
Assignee: Michael Armbrust
Priority: Blocker

 {code}
 [info] CachedTableSuite:
 [info] - read from cached table and uncache *** FAILED ***
 [info]   java.lang.RuntimeException: Table Not Found: testData
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64)
 [info]   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:185)
 [info]   at 
 org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply$mcV$sp(CachedTableSuite.scala:43)
 [info]   at 
 org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply(CachedTableSuite.scala:27)
 [info]   at 
 org.apache.spark.sql.CachedTableSuite$$anonfun$1.apply(CachedTableSuite.scala:27)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
 [info]   ...
 [info] - correct error on uncache of non-cached table *** FAILED ***
 [info]   Expected exception java.lang.IllegalArgumentException to be thrown, 
 but java.lang.RuntimeException was thrown. (CachedTableSuite.scala:55)
 [info] - SELECT Star Cached Table *** FAILED ***
 [info]   java.lang.RuntimeException: Table Not Found: testData
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:67)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:65)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
 [info]   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]   ...
 [info] - Self-join cached *** FAILED ***
 [info]   java.lang.RuntimeException: Table Not Found: testData
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:67)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$1.applyOrElse(Analyzer.scala:65)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
 [info]   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 [info]   ...
 [info] - 'CACHE TABLE' and 'UNCACHE TABLE' SQL statement *** FAILED ***
 [info]   java.lang.RuntimeException: Table Not Found: testData
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:64)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at 
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:64)
 [info]   at org.apache.spark.sql.SQLContext.cacheTable(SQLContext.scala:189)
 [info]   at 
 org.apache.spark.sql.execution.CacheCommand.sideEffectResult$lzycompute(commands.scala:110)
 [info]   at 
 org.apache.spark.sql.execution.CacheCommand.sideEffectResult(commands.scala:108)
 [info]   at 
 org.apache.spark.sql.execution.CacheCommand.execute(commands.scala:118)
 [info]   at

[jira] [Commented] (SPARK-2259) Spark submit documentation for --deploy-mode is highly misleading


[ 
https://issues.apache.org/jira/browse/SPARK-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042811#comment-14042811
 ] 

Andrew Or commented on SPARK-2259:
--

https://github.com/apache/spark/pull/1200

 Spark submit documentation for --deploy-mode is highly misleading
 -

 Key: SPARK-2259
 URL: https://issues.apache.org/jira/browse/SPARK-2259
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 There are a few issues:
 1. Client mode does not necessarily mean the driver program must be launched 
 outside of the cluster.
 2. For standalone clusters, only client mode is currently supported. This was 
 the case supported even before 1.0.
 Currently, the docs tell the user to use cluster deploy mode when deploying 
 your driver program within the cluster, which is true also for 
 standalone-client mode. In short, the docs encourage the user to use 
 standalone-cluster, an unsupported mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2266) Log page on Worker UI displays Some(app-id)


 [ 
https://issues.apache.org/jira/browse/SPARK-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2266:
-

Attachment: Screen Shot 2014-06-24 at 5.07.54 PM.png

 Log page on Worker UI displays Some(app-id)
 -

 Key: SPARK-2266
 URL: https://issues.apache.org/jira/browse/SPARK-2266
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Andrew Or
Priority: Minor
 Fix For: 1.1.0

 Attachments: Screen Shot 2014-06-24 at 5.07.54 PM.png


 Oops.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2267) Log the exception when TaskResultGetter fails to fetch/deserialize task result


 [ 
https://issues.apache.org/jira/browse/SPARK-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2267:
---

Summary: Log the exception when TaskResultGetter fails to fetch/deserialize 
task result  (was: Log the exception when we fail to fetch/deserialize task 
result)

 Log the exception when TaskResultGetter fails to fetch/deserialize task result
 --

 Key: SPARK-2267
 URL: https://issues.apache.org/jira/browse/SPARK-2267
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin

 In 1.0.0, it is pretty confusing to get an error message that just says: 
 Exception while deserializing and fetching task ... without a stack trace. 
 This has been fixed in master by [~matei] 
 https://github.com/apache/spark/commit/a01f3401e32ca4324884d13c9fad53c6c87bb5f0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2267) Log exception when TaskResultGetter fails to fetch/deserialize task result


 [ 
https://issues.apache.org/jira/browse/SPARK-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2267:
---

Summary: Log exception when TaskResultGetter fails to fetch/deserialize 
task result  (was: Log the exception when TaskResultGetter fails to 
fetch/deserialize task result)

 Log exception when TaskResultGetter fails to fetch/deserialize task result
 --

 Key: SPARK-2267
 URL: https://issues.apache.org/jira/browse/SPARK-2267
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin

 In 1.0.0, it is pretty confusing to get an error message that just says: 
 Exception while deserializing and fetching task ... without a stack trace. 
 This has been fixed in master by [~matei] 
 https://github.com/apache/spark/commit/a01f3401e32ca4324884d13c9fad53c6c87bb5f0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2268) Utils.createTempDir() creates race with HDFS at shutdown

2014-06-24 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-2268:
-

 Summary: Utils.createTempDir() creates race with HDFS at shutdown
 Key: SPARK-2268
 URL: https://issues.apache.org/jira/browse/SPARK-2268
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin


Utils.createTempDir() has this code:

{code}
// Add a shutdown hook to delete the temp dir when the JVM exits
Runtime.getRuntime.addShutdownHook(new Thread(delete Spark temp dir  + 
dir) {
  override def run() {
// Attempt to delete if some patch which is parent of this is not 
already registered.
if (! hasRootAsShutdownDeleteDir(dir)) Utils.deleteRecursively(dir)
  }
})
{code}

This creates a race with the shutdown hooks registered by HDFS, since the order 
of execution is undefined; if the HDFS hooks run first, you'll get exceptions 
about the file system being closed.

Instead, this should use Hadoop's ShutdownHookManager with a proper priority, 
so that it runs before the HDFS hooks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2192) Examples Data Not in Binary Distribution

2014-06-24 Thread Henry Saputra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042938#comment-14042938
 ] 

Henry Saputra commented on SPARK-2192:
--

I think several tests already have the data in the main/resources. Do you have 
list of which ones missing?

 Examples Data Not in Binary Distribution
 

 Key: SPARK-2192
 URL: https://issues.apache.org/jira/browse/SPARK-2192
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Pat McDonough

 The data used by examples is not packaged up with the binary distribution. 
 The data subdirectory of spark should make it's way in to the distribution 
 somewhere so the examples can use it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2263) Can't insert MapK, V values to Hive tables

2014-06-24 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042941#comment-14042941
 ] 

Cheng Lian commented on SPARK-2263:
---

PR: https://github.com/apache/spark/pull/1205

 Can't insert MapK, V values to Hive tables
 

 Key: SPARK-2263
 URL: https://issues.apache.org/jira/browse/SPARK-2263
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian

 Scala {{Map\[K, V\]}} values are not converted to their Java correspondence:
 {code}
 scala loadTestTable(src)
 scala hql(create table m(value mapint, string))
 res1: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[0] at RDD at SchemaRDD.scala:100
 == Query Plan ==
 Native command: executed by Hive
 scala hql(insert overwrite table m select map(key, value) from src)
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 
 failed 1 times, most recent failure: Exception failure in TID 0 on host 
 localhost: java.lang.ClassCastException: scala.collection.immutable.Map$Map1 
 cannot be cast to java.util.Map
 
 org.apache.hadoop.hive.serde2.objectinspector.StandardMapObjectInspector.getMap(StandardMapObjectInspector.java:82)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:515)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$2$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$2$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1040)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1024)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1022)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1022)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:640)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:640)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:640)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1214)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 scala 
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2245) VertexRDD can not be materialized for checkpointing

2014-06-24 Thread Baoxu Shi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041448#comment-14041448
 ] 

Baoxu Shi edited comment on SPARK-2245 at 6/25/14 1:43 AM:
---

Hi [~ankurd], I changed my pull request. But there is another exception, 
ShippableVertexPartition is not serializable. So I serialized it, but there is 
another exception org.apache.spark.graphx.impl.RoutingTablePartition is not 
serializable.  Then I serialized it again, but on iteration 2 there will be an 
exception: org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast 
to scala.Tuple2

The code I'm using are:

val conf = new SparkConf().setAppName(HDTM)
  .setMaster(local[4])
val sc = new SparkContext(conf)
sc.setCheckpointDir(./checkpoint)
val v = sc.parallelize(Seq[(VertexId, Long)]((0L, 0L), (1L, 1L), (2L, 2L)))
val e = sc.parallelize(Seq[Edge[Long]](Edge(0L, 1L, 0L), Edge(1L, 2L, 1L), 
Edge(2L, 0L, 2L)))
var g = Graph(v, e)

val vertexIds = Seq(0L, 1L, 2L)
var prevG: Graph[VertexId, Long] = null
for (i - 1 to 2000) {
  vertexIds.toStream.foreach(id = {
prevG = g
g = Graph(g.vertices, g.edges)

g.vertices.cache()
g.edges.cache()
prevG.unpersistVertices(blocking = false)
prevG.edges.unpersist(blocking = false)
  })

  g.vertices.checkpoint()
  g.edges.checkpoint()

  g.edges.count()
  g.vertices.count()
  println(s${g.vertices.isCheckpointed} ${g.edges.isCheckpointed})

  println( iter  + i +  finished)
}

println(g.vertices.collect().mkString( ))
println(g.edges.collect().mkString( ))

Am I on the right track? Or Should there be another way to change it?


was (Author: bxshi):
Just submit the changes, thanks!

 VertexRDD can not be materialized for checkpointing
 ---

 Key: SPARK-2245
 URL: https://issues.apache.org/jira/browse/SPARK-2245
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Baoxu Shi

 Seems one can not materialize VertexRDD by simply calling count method, which 
 is overridden by VertexRDD. But if you call RDD's count, it could materialize 
 it.
 Is this a feature that designed to get the count without materialize 
 VertexRDD? If so, do you guys think it is necessary to add a materialize 
 method to VertexRDD?
 By the way, does count() is the cheapest way to materialize a RDD? Or it just 
 cost the same resources like other actions?
 The pull request is here:
 https://github.com/apache/spark/pull/1177
 Best,



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2269) Clean up and add unit tests for resourceOffers in MesosSchedulerBackend

Patrick Wendell created SPARK-2269:
--

 Summary: Clean up and add unit tests for resourceOffers in 
MesosSchedulerBackend
 Key: SPARK-2269
 URL: https://issues.apache.org/jira/browse/SPARK-2269
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Patrick Wendell


This function could be simplified a bit. We could re-write it without 
offerableIndices or creating the mesosTasks array as large as the offer list. 
There is a lot of logic around making sure you get the correct index into 
mesosTasks and offers, really we should just build mesosTasks directly from the 
offers we get back. To associate the tasks we are launching with the offers we 
can just create a hashMap from the slaveId to the original offer.

The basic logic of the function is that you take the mesos offers, convert them 
to spark offers, then convert the results back.

One thing we should check is whether Mesos guarantees that it won't give two 
offers for the same worker. That would make things much more complicated.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2156) When the size of serialized results for one partition is slightly smaller than 10MB (the default akka.frameSize), the execution blocks


[ 
https://issues.apache.org/jira/browse/SPARK-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042962#comment-14042962
 ] 

Patrick Wendell commented on SPARK-2156:


Fixed in 1.1.0 via: 
https://github.com/apache/spark/pull/1132

 When the size of serialized results for one partition is slightly smaller 
 than 10MB (the default akka.frameSize), the execution blocks
 --

 Key: SPARK-2156
 URL: https://issues.apache.org/jira/browse/SPARK-2156
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.0
 Environment: AWS EC2 1 master 2 slaves with the instance type of 
 r3.2xlarge
Reporter: Chen Jin
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 1.0.1, 1.1.0

   Original Estimate: 504h
  Remaining Estimate: 504h

  I have done some experiments when the frameSize is around 10MB .
 1) spark.akka.frameSize = 10
 If one of the partition size is very close to 10MB, say 9.97MB, the execution 
 blocks without any exception or warning. Worker finished the task to send the 
 serialized result, and then throw exception saying hadoop IPC client 
 connection stops (changing the logging to debug level). However, the master 
 never receives the results and the program just hangs.
 But if sizes for all the partitions less than some number btw 9.96MB amd 
 9.97MB, the program works fine.
 2) spark.akka.frameSize = 9
 when the partition size is just a little bit smaller than 9MB, it fails as 
 well.
 This bug behavior is not exactly what spark-1112 is about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2270) Kryo cannot serialize results returned by asJavaIterable (and thus groupBy/cogroup are broken in Java APIs when Kryo is used)

Reynold Xin created SPARK-2270:
--

 Summary: Kryo cannot serialize results returned by asJavaIterable 
(and thus groupBy/cogroup are broken in Java APIs when Kryo is used)
 Key: SPARK-2270
 URL: https://issues.apache.org/jira/browse/SPARK-2270
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical


The combination of Kryo serializer  Java API could lead to the following 
exception in groupBy/groupByKey/cogroup:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while deserializing and fetching task: java.lang.UnsupportedOperationException
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
scala.Option.foreach(Option.scala:236)
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
akka.actor.ActorCell.invoke(ActorCell.scala:456)
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
akka.dispatch.Mailbox.run(Mailbox.scala:219)
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)3:45
 PM
{code}

or
{code}
14/06/24 16:38:09 ERROR TaskResultGetter: Exception while getting task result
java.lang.UnsupportedOperationException
at java.util.AbstractCollection.add(AbstractCollection.java:260)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at carbonite.serializer$mk_collection_reader$fn__50.invoke(serializer.clj:57)
at clojure.lang.Var.invoke(Var.java:383)
at carbonite.ClojureVecSerializer.read(ClojureVecSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:144)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at 
org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:480)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:316)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1213)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}


Thanks [~sorenmacbeth] for reporting this.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2248) spark.default.parallelism does not apply in local mode


 [ 
https://issues.apache.org/jira/browse/SPARK-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2248.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1194
[https://github.com/apache/spark/pull/1194]

 spark.default.parallelism does not apply in local mode
 --

 Key: SPARK-2248
 URL: https://issues.apache.org/jira/browse/SPARK-2248
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Guoqiang Li
Priority: Trivial
  Labels: Starter
 Fix For: 1.1.0


 LocalBackend.defaultParallelism ignores the spark.default.parallelism 
 property, unlike the other SchedulerBackends. We should make it take this in 
 for consistency.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal


 [ 
https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2271:
---

Description: Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017

 Use Hive's high performance Decimal128 to replace BigDecimal
 

 Key: SPARK-2271
 URL: https://issues.apache.org/jira/browse/SPARK-2271
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Lian

 Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal