[jira] [Resolved] (SPARK-958) When iteration in ALS increases to 10 running in local mode, spark throws out error of StackOverflowError

2014-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-958.
-

Resolution: Duplicate

 When iteration in ALS increases to 10 running in local mode, spark throws out 
 error of StackOverflowError
 -

 Key: SPARK-958
 URL: https://issues.apache.org/jira/browse/SPARK-958
 Project: Spark
  Issue Type: Bug
Reporter: Qiuzhuang Lian

 I try to use ml-100k data to test ALS running in local mode in mllib project. 
 If I specify iteration to be less than, it works well. However, when 
 iteration is increased to more than 10 iterations,  spark throws out error of 
 StackOverflowError.
 Attached is the log file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1380) Add sort-merge based cogroup/joins.

2014-04-01 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-1380:


 Summary: Add sort-merge based cogroup/joins.
 Key: SPARK-1380
 URL: https://issues.apache.org/jira/browse/SPARK-1380
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Takuya Ueshin


I've written cogroup/joins based on 'Sort-Merge' algorithm.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1380) Add sort-merge based cogroup/joins.

2014-04-01 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956313#comment-13956313
 ] 

Takuya Ueshin commented on SPARK-1380:
--

Pull-requested: https://github.com/apache/spark/pull/283

 Add sort-merge based cogroup/joins.
 ---

 Key: SPARK-1380
 URL: https://issues.apache.org/jira/browse/SPARK-1380
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Takuya Ueshin

 I've written cogroup/joins based on 'Sort-Merge' algorithm.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Joe Schaefer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956607#comment-13956607
 ] 

Joe Schaefer commented on SPARK-1355:
-

It just looks funny that a cutting edge project like Spark should rely on a 
vanilla cookie-cutter blog-site generator like jekyll to manage its website 
assets.  Go for broke and grasp the brass ring- bring your website technology 
to new levels with the Apache CMS!

 Switch website to the Apache CMS
 

 Key: SPARK-1355
 URL: https://issues.apache.org/jira/browse/SPARK-1355
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Joe Schaefer

 Jekyll is ancient history useful for small blogger sites and little else.  
 Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
 .md files and interfaces with pygments for code highlighting.  Thrift 
 recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956642#comment-13956642
 ] 

Mark Hamstra commented on SPARK-1355:
-

Resources are limited as we progress toward our 1.0 release.  I can't see 
reallocating those commitments just to avoid looking funny in the estimation of 
some observers.  If someone not otherwise occupied wants to contribute the work 
to convert to Apache CMS, that's another thing.

 Switch website to the Apache CMS
 

 Key: SPARK-1355
 URL: https://issues.apache.org/jira/browse/SPARK-1355
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Joe Schaefer

 Jekyll is ancient history useful for small blogger sites and little else.  
 Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
 .md files and interfaces with pygments for code highlighting.  Thrift 
 recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Joe Schaefer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956683#comment-13956683
 ] 

Joe Schaefer commented on SPARK-1355:
-

Nonesense- you have plenty of time just lack the appropriate prioritization for 
this task, which should be marked Critical as we are trying to help you help 
yourselves.  Do yourselves a solid and get it done this week to avoid further 
embarrassment, mkay?

 Switch website to the Apache CMS
 

 Key: SPARK-1355
 URL: https://issues.apache.org/jira/browse/SPARK-1355
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Joe Schaefer

 Jekyll is ancient history useful for small blogger sites and little else.  
 Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
 .md files and interfaces with pygments for code highlighting.  Thrift 
 recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1364) DataTypes missing from ScalaReflection

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1364:


Priority: Blocker  (was: Major)

 DataTypes missing from ScalaReflection
 --

 Key: SPARK-1364
 URL: https://issues.apache.org/jira/browse/SPARK-1364
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0


 BigDecimal, possibly others.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1371) HashAggregate should stream tuples and avoid doing an extra count

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1371:


Priority: Blocker  (was: Major)

 HashAggregate should stream tuples and avoid doing an extra count
 -

 Key: SPARK-1371
 URL: https://issues.apache.org/jira/browse/SPARK-1371
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1367) NPE when joining Parquet Relations

2014-04-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956781#comment-13956781
 ] 

Michael Armbrust commented on SPARK-1367:
-

No, in that commit there is a TODO as the testcase still NPEs.  We still need 
to remove the @transient from ParquetTableScan.  If you don't have time to do 
this I can.

 NPE when joining Parquet Relations
 --

 Key: SPARK-1367
 URL: https://issues.apache.org/jira/browse/SPARK-1367
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Andre Schumacher
Priority: Blocker
 Fix For: 1.0.0


 {code}
   test(self-join parquet files) {
 val x = ParquetTestData.testData.subquery('x)
 val y = ParquetTestData.testData.newInstance.subquery('y)
 val query = x.join(y).where(x.myint.attr === y.myint.attr)
 query.collect()
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1384) spark-shell on yarn on spark 0.9 branch doesn't always work with secure hdfs

2014-04-01 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-1384:
-

Description: 
 I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 rc3 
release.  It doesn't work with secure HDFS unless you 
export SPARK_YARN_MODE=true before starting the shell, or if you happen to do 
something immediately with HDFS.  If you wait for the connection to the 
namenode to timeout it will fail. 
 
The fix actually went in to master branch  with the authentication changes I 
made in master but I never realized that change needed to apply to 0.9. 

https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0
See the SparkILoop diff.


  was:
 I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 rc3 
release.  It doesn't work with secure HDFS unless you 
export SPARK_YARN_MODE=true before starting the shell, or if you happen to do 
something immediately with HDFS.  If you wait for the connection to the 
namenode to timeout it will fail. 

I think it was actually this way in the 0.9 release also so I thought I would 
send this and get peoples feedback to see if you want it fixed? 

Another option would be to document that you have to export 
SPARK_YARN_MODE=true for the shell.   The fix actually went in with the 
authentication changes I made in master but I never realized that change needed 
to apply to 0.9. 

https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0
See the SparkILoop diff.



 spark-shell on yarn on spark 0.9 branch doesn't always work with secure hdfs
 

 Key: SPARK-1384
 URL: https://issues.apache.org/jira/browse/SPARK-1384
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 0.9.0, 0.9.1
Reporter: Thomas Graves

  I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 
 rc3 release.  It doesn't work with secure HDFS unless you 
 export SPARK_YARN_MODE=true before starting the shell, or if you happen to do 
 something immediately with HDFS.  If you wait for the connection to the 
 namenode to timeout it will fail. 
  
 The fix actually went in to master branch  with the authentication changes I 
 made in master but I never realized that change needed to apply to 0.9. 
 https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0
 See the SparkILoop diff.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1385) Use existing code-path for JSON de/serialization of BlockId

2014-04-01 Thread Andrew Or (JIRA)
Andrew Or created SPARK-1385:


 Summary: Use existing code-path for JSON de/serialization of 
BlockId
 Key: SPARK-1385
 URL: https://issues.apache.org/jira/browse/SPARK-1385
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0, 0.9.1
Reporter: Andrew Or
Priority: Minor
 Fix For: 1.0.0


BlockId.scala already takes care of JSON de/serialization by parsing the string 
to and from regex. This functionality is currently duplicated in 
util/JsonProtocol.scala.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1386) Spark Streaming UI

2014-04-01 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1386:
-

Priority: Blocker  (was: Major)

 Spark Streaming UI
 --

 Key: SPARK-1386
 URL: https://issues.apache.org/jira/browse/SPARK-1386
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das
Priority: Blocker

 When debugging Spark Streaming applications it is necessary to monitor 
 certain metrics that are not shown in the Spark application UI. For example, 
 what is average processing time of batches? What is the scheduling delay? Is 
 the system able to process as fast as it is receiving data? How many records 
 I am receiving through my receivers? 
 While the StreamingListener interface introduced in the 0.9 provided some of 
 this information, it could only be accessed programmatically. A UI that shows 
 information specific to the streaming applications is necessary for easier 
 debugging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1386) Spark Streaming UI

2014-04-01 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-1386:


 Summary: Spark Streaming UI
 Key: SPARK-1386
 URL: https://issues.apache.org/jira/browse/SPARK-1386
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das


When debugging Spark Streaming applications it is necessary to monitor certain 
metrics that are not shown in the Spark application UI. For example, what is 
average processing time of batches? What is the scheduling delay? Is the system 
able to process as fast as it is receiving data? How many records I am 
receiving through my receivers? 

While the StreamingListener interface introduced in the 0.9 provided some of 
this information, it could only be accessed programmatically. A UI that shows 
information specific to the streaming applications is necessary for easier 
debugging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1386) Spark Streaming UI

2014-04-01 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1386:
-

Affects Version/s: 0.9.0

 Spark Streaming UI
 --

 Key: SPARK-1386
 URL: https://issues.apache.org/jira/browse/SPARK-1386
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 0.9.0
Reporter: Tathagata Das
Priority: Blocker

 When debugging Spark Streaming applications it is necessary to monitor 
 certain metrics that are not shown in the Spark application UI. For example, 
 what is average processing time of batches? What is the scheduling delay? Is 
 the system able to process as fast as it is receiving data? How many records 
 I am receiving through my receivers? 
 While the StreamingListener interface introduced in the 0.9 provided some of 
 this information, it could only be accessed programmatically. A UI that shows 
 information specific to the streaming applications is necessary for easier 
 debugging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1332) Improve Spark Streaming's Network Receiver and InputDStream API for future stability

2014-04-01 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1332:
-

Priority: Blocker  (was: Critical)

 Improve Spark Streaming's Network Receiver and InputDStream API for future 
 stability
 

 Key: SPARK-1332
 URL: https://issues.apache.org/jira/browse/SPARK-1332
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.9.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 The current Network Receiver API makes it slightly complicated to right a new 
 receiver as one needs to create an instance of BlockGenerator as shown in 
 SocketReceiver 
 https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala#L51
 Exposing the BlockGenerator interface has made it harder to improve the 
 receiving process. The API of NetworkReceiver (which was not a very stable 
 API anyways) needs to be change if we are to ensure future stability. 
 Additionally, the functions like streamingContext.socketStream that create 
 input streams, return DStream objects. That makes it hard to expose 
 functionality (say, rate limits) unique to input dstreams. They should return 
 InputDStream or NetworkInputDStream.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1387) Update build plugins, avoid plugin version warning, centralize versions

2014-04-01 Thread Sean Owen (JIRA)
Sean Owen created SPARK-1387:


 Summary: Update build plugins,  avoid plugin version warning, 
centralize versions
 Key: SPARK-1387
 URL: https://issues.apache.org/jira/browse/SPARK-1387
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
Priority: Minor


Another handful of small build changes to organize and standardize a bit, and 
avoid warnings:

- Update Maven plugin versions for good measure
- Since plugins need maven 3.0.4 already, require it explicitly (3.0.4 had 
some bugs anyway)
- Use variables to define versions across dependencies where they should move 
in lock step
- ... and make this consistent between Maven/SBT




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1367) NPE when joining Parquet Relations

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-1367:
---

Assignee: Michael Armbrust  (was: Andre Schumacher)

 NPE when joining Parquet Relations
 --

 Key: SPARK-1367
 URL: https://issues.apache.org/jira/browse/SPARK-1367
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0


 {code}
   test(self-join parquet files) {
 val x = ParquetTestData.testData.subquery('x)
 val y = ParquetTestData.testData.newInstance.subquery('y)
 val query = x.join(y).where(x.myint.attr === y.myint.attr)
 query.collect()
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957113#comment-13957113
 ] 

Sean Owen commented on SPARK-1355:
--

April Fools, apparently. Though this was opened on 30 March? 

 Switch website to the Apache CMS
 

 Key: SPARK-1355
 URL: https://issues.apache.org/jira/browse/SPARK-1355
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Joe Schaefer

 Jekyll is ancient history useful for small blogger sites and little else.  
 Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
 .md files and interfaces with pygments for code highlighting.  Thrift 
 recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Ken Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957124#comment-13957124
 ] 

Ken Williams commented on SPARK-1378:
-

What resolved it on our end was to either run without a local MVN repo (moving 
the {{~/.m2/settings.xml}} out of the way) or adding the mqtt-repo 
(https://repo.eclipse.org/content/repositories/paho-releases) to our set of 
mirrors.

 Build error: org.eclipse.paho:mqtt-client
 -

 Key: SPARK-1378
 URL: https://issues.apache.org/jira/browse/SPARK-1378
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.0
Reporter: Ken Williams

 Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
 attempt like so:
 {code}
 mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
 {code}
 The Maven error is:
 {code}
 [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
 resolve dependencies for project 
 org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
 artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
 {code}
 My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
 Is there an additional Maven repository I should add or something?
 If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
 {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
 but I would really like to get the examples working because I haven't played 
 with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Joe Schaefer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Schaefer closed SPARK-1355.
---

Resolution: Invalid

 Switch website to the Apache CMS
 

 Key: SPARK-1355
 URL: https://issues.apache.org/jira/browse/SPARK-1355
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Joe Schaefer

 Jekyll is ancient history useful for small blogger sites and little else.  
 Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
 .md files and interfaces with pygments for code highlighting.  Thrift 
 recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-04-01 Thread Nishkam Ravi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi updated SPARK-1388:


Attachment: (was: Conf_Spark.patch)

 ConcurrentModificationException in hadoop_common exposed by Spark
 -

 Key: SPARK-1388
 URL: https://issues.apache.org/jira/browse/SPARK-1388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Nishkam Ravi
 Attachments: nravi_Conf_Spark-1388.patch


 The following exception occurs non-deterministically:
 java.util.ConcurrentModificationException
 at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
 at java.util.HashMap$KeyIterator.next(HashMap.java:960)
 at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
 at java.util.HashSet.init(HashSet.java:117)
 at org.apache.hadoop.conf.Configuration.init(Configuration.java:671)
 at org.apache.hadoop.mapred.JobConf.init(JobConf.java:439)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:154)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
 at org.apache.spark.scheduler.Task.run(Task.scala:53)
 at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1097) ConcurrentModificationException

2014-04-01 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957233#comment-13957233
 ] 

Nishkam Ravi commented on SPARK-1097:
-

Attached is a patch for this issue. Verified with mvn test/compile/install. 

 ConcurrentModificationException
 ---

 Key: SPARK-1097
 URL: https://issues.apache.org/jira/browse/SPARK-1097
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Fabrizio Milo
 Attachments: nravi_Conf_Spark-1388.patch


 {noformat}
 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
 java.util.ConcurrentModificationException
 java.util.ConcurrentModificationException
   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
   at java.util.HashSet.init(HashSet.java:117)
   at org.apache.hadoop.conf.Configuration.init(Configuration.java:554)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:439)
   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
   at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:154)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
   at org.apache.spark.scheduler.Task.run(Task.scala:53)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1271) Use Iterator[X] in co-group and group-by signatures

2014-04-01 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957293#comment-13957293
 ] 

holdenk commented on SPARK-1271:


https://github.com/apache/spark/pull/242

 Use Iterator[X] in co-group and group-by signatures
 ---

 Key: SPARK-1271
 URL: https://issues.apache.org/jira/browse/SPARK-1271
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 This API change will allow us to externalize these things down the road.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-939) Allow user jars to take precedence over Spark jars, if desired

2014-04-01 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957294#comment-13957294
 ] 

holdenk commented on SPARK-939:
---

https://github.com/apache/spark/pull/217

 Allow user jars to take precedence over Spark jars, if desired
 --

 Key: SPARK-939
 URL: https://issues.apache.org/jira/browse/SPARK-939
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: holdenk
Priority: Blocker
  Labels: starter
 Fix For: 1.0.0


 Sometimes a user may want to include their own version of a jar that spark 
 itself uses. For example, if their code requires a newer version of that jar 
 than Spark offers. It would be good to have an option to give the users 
 dependencies precedence over Spark. This options should be disabled by 
 default, since it could lead to some odd behavior (e.g. parts of Spark not 
 working). But I think we should have it.
 From an implementation perspective, this would require modifying the way we 
 do class loading inside of an Executor. The default behavior of the  
 URLClassLoader is to delegate to it's parent first and, if that fails, to 
 find a class locally. We want to have the opposite behavior. This is 
 sometimes referred to as parent-last (as opposed to parent-first) class 
 loading precedence. There is an example of how to do this here:
 http://stackoverflow.com/questions/5445511/how-do-i-create-a-parent-last-child-first-classloader-in-java-or-how-to-overr
 We should write a similar class which can encapsulate a URL classloader and 
 change the delegation order. Or if possible, maybe we could find a more 
 elegant way to do this. See relevant discussion on the user list here:
 https://groups.google.com/forum/#!topic/spark-users/b278DW3e38g
 Also see the corresponding option in Hadoop:
 https://issues.apache.org/jira/browse/MAPREDUCE-4521
 Some other relevant Hadoop JIRA's:
 https://issues.apache.org/jira/browse/MAPREDUCE-1700
 https://issues.apache.org/jira/browse/MAPREDUCE-1938



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957330#comment-13957330
 ] 

Sean Owen commented on SPARK-1388:
--

Yes this should be resolved as a duplicate instead.

 ConcurrentModificationException in hadoop_common exposed by Spark
 -

 Key: SPARK-1388
 URL: https://issues.apache.org/jira/browse/SPARK-1388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Nishkam Ravi
 Attachments: nravi_Conf_Spark-1388.patch


 The following exception occurs non-deterministically:
 java.util.ConcurrentModificationException
 at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
 at java.util.HashMap$KeyIterator.next(HashMap.java:960)
 at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
 at java.util.HashSet.init(HashSet.java:117)
 at org.apache.hadoop.conf.Configuration.init(Configuration.java:671)
 at org.apache.hadoop.mapred.JobConf.init(JobConf.java:439)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:154)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
 at org.apache.spark.scheduler.Task.run(Task.scala:53)
 at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1371) HashAggregate should stream tuples and avoid doing an extra count

2014-04-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957365#comment-13957365
 ] 

Michael Armbrust commented on SPARK-1371:
-

https://github.com/apache/spark/pull/295

 HashAggregate should stream tuples and avoid doing an extra count
 -

 Key: SPARK-1371
 URL: https://issues.apache.org/jira/browse/SPARK-1371
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1372) Expose in-memory columnar caching for tables.

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1372.
-

Resolution: Fixed

 Expose in-memory columnar caching for tables.
 -

 Key: SPARK-1372
 URL: https://issues.apache.org/jira/browse/SPARK-1372
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1364) DataTypes missing from ScalaReflection

2014-04-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957366#comment-13957366
 ] 

Michael Armbrust commented on SPARK-1364:
-

https://github.com/apache/spark/pull/293

 DataTypes missing from ScalaReflection
 --

 Key: SPARK-1364
 URL: https://issues.apache.org/jira/browse/SPARK-1364
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0


 BigDecimal, possibly others.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-04-01 Thread Pat McDonough (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat McDonough updated SPARK-1392:
-

Description: 
Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
the spark-shell locally in out of the box configuration, and attempting to 
cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
overhead limit exceeded

You can work around the issue by either decreasing spark.storage.memoryFraction 
or increasing SPARK_MEM

  was:
Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
the spark-shell locally in out of the box configuration, and attempting to 
cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
overhead limit exceeded

You can work around the issue by either decreasing 
{{spark.storage.memoryFraction}} or increasing {{SPARK_MEM}}


 Local spark-shell Runs Out of Memory With Default Settings
 --

 Key: SPARK-1392
 URL: https://issues.apache.org/jira/browse/SPARK-1392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
Reporter: Pat McDonough

 Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
 the spark-shell locally in out of the box configuration, and attempting to 
 cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
 overhead limit exceeded
 You can work around the issue by either decreasing 
 spark.storage.memoryFraction or increasing SPARK_MEM



--
This message was sent by Atlassian JIRA
(v6.2#6252)