[jira] [Resolved] (SPARK-958) When iteration in ALS increases to 10 running in local mode, spark throws out error of StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-958. - Resolution: Duplicate When iteration in ALS increases to 10 running in local mode, spark throws out error of StackOverflowError - Key: SPARK-958 URL: https://issues.apache.org/jira/browse/SPARK-958 Project: Spark Issue Type: Bug Reporter: Qiuzhuang Lian I try to use ml-100k data to test ALS running in local mode in mllib project. If I specify iteration to be less than, it works well. However, when iteration is increased to more than 10 iterations, spark throws out error of StackOverflowError. Attached is the log file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1380) Add sort-merge based cogroup/joins.
Takuya Ueshin created SPARK-1380: Summary: Add sort-merge based cogroup/joins. Key: SPARK-1380 URL: https://issues.apache.org/jira/browse/SPARK-1380 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Takuya Ueshin I've written cogroup/joins based on 'Sort-Merge' algorithm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1380) Add sort-merge based cogroup/joins.
[ https://issues.apache.org/jira/browse/SPARK-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956313#comment-13956313 ] Takuya Ueshin commented on SPARK-1380: -- Pull-requested: https://github.com/apache/spark/pull/283 Add sort-merge based cogroup/joins. --- Key: SPARK-1380 URL: https://issues.apache.org/jira/browse/SPARK-1380 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Takuya Ueshin I've written cogroup/joins based on 'Sort-Merge' algorithm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS
[ https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956607#comment-13956607 ] Joe Schaefer commented on SPARK-1355: - It just looks funny that a cutting edge project like Spark should rely on a vanilla cookie-cutter blog-site generator like jekyll to manage its website assets. Go for broke and grasp the brass ring- bring your website technology to new levels with the Apache CMS! Switch website to the Apache CMS Key: SPARK-1355 URL: https://issues.apache.org/jira/browse/SPARK-1355 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Joe Schaefer Jekyll is ancient history useful for small blogger sites and little else. Why not upgrade to the Apache CMS? It supports the same on-disk format for .md files and interfaces with pygments for code highlighting. Thrift recently switched from nanoc to the CMS and loves it! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS
[ https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956642#comment-13956642 ] Mark Hamstra commented on SPARK-1355: - Resources are limited as we progress toward our 1.0 release. I can't see reallocating those commitments just to avoid looking funny in the estimation of some observers. If someone not otherwise occupied wants to contribute the work to convert to Apache CMS, that's another thing. Switch website to the Apache CMS Key: SPARK-1355 URL: https://issues.apache.org/jira/browse/SPARK-1355 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Joe Schaefer Jekyll is ancient history useful for small blogger sites and little else. Why not upgrade to the Apache CMS? It supports the same on-disk format for .md files and interfaces with pygments for code highlighting. Thrift recently switched from nanoc to the CMS and loves it! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS
[ https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956683#comment-13956683 ] Joe Schaefer commented on SPARK-1355: - Nonesense- you have plenty of time just lack the appropriate prioritization for this task, which should be marked Critical as we are trying to help you help yourselves. Do yourselves a solid and get it done this week to avoid further embarrassment, mkay? Switch website to the Apache CMS Key: SPARK-1355 URL: https://issues.apache.org/jira/browse/SPARK-1355 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Joe Schaefer Jekyll is ancient history useful for small blogger sites and little else. Why not upgrade to the Apache CMS? It supports the same on-disk format for .md files and interfaces with pygments for code highlighting. Thrift recently switched from nanoc to the CMS and loves it! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1364) DataTypes missing from ScalaReflection
[ https://issues.apache.org/jira/browse/SPARK-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1364: Priority: Blocker (was: Major) DataTypes missing from ScalaReflection -- Key: SPARK-1364 URL: https://issues.apache.org/jira/browse/SPARK-1364 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Blocker Fix For: 1.0.0 BigDecimal, possibly others. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1371) HashAggregate should stream tuples and avoid doing an extra count
[ https://issues.apache.org/jira/browse/SPARK-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1371: Priority: Blocker (was: Major) HashAggregate should stream tuples and avoid doing an extra count - Key: SPARK-1371 URL: https://issues.apache.org/jira/browse/SPARK-1371 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1367) NPE when joining Parquet Relations
[ https://issues.apache.org/jira/browse/SPARK-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956781#comment-13956781 ] Michael Armbrust commented on SPARK-1367: - No, in that commit there is a TODO as the testcase still NPEs. We still need to remove the @transient from ParquetTableScan. If you don't have time to do this I can. NPE when joining Parquet Relations -- Key: SPARK-1367 URL: https://issues.apache.org/jira/browse/SPARK-1367 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Andre Schumacher Priority: Blocker Fix For: 1.0.0 {code} test(self-join parquet files) { val x = ParquetTestData.testData.subquery('x) val y = ParquetTestData.testData.newInstance.subquery('y) val query = x.join(y).where(x.myint.attr === y.myint.attr) query.collect() } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1384) spark-shell on yarn on spark 0.9 branch doesn't always work with secure hdfs
[ https://issues.apache.org/jira/browse/SPARK-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-1384: - Description: I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 rc3 release. It doesn't work with secure HDFS unless you export SPARK_YARN_MODE=true before starting the shell, or if you happen to do something immediately with HDFS. If you wait for the connection to the namenode to timeout it will fail. The fix actually went in to master branch with the authentication changes I made in master but I never realized that change needed to apply to 0.9. https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0 See the SparkILoop diff. was: I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 rc3 release. It doesn't work with secure HDFS unless you export SPARK_YARN_MODE=true before starting the shell, or if you happen to do something immediately with HDFS. If you wait for the connection to the namenode to timeout it will fail. I think it was actually this way in the 0.9 release also so I thought I would send this and get peoples feedback to see if you want it fixed? Another option would be to document that you have to export SPARK_YARN_MODE=true for the shell. The fix actually went in with the authentication changes I made in master but I never realized that change needed to apply to 0.9. https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0 See the SparkILoop diff. spark-shell on yarn on spark 0.9 branch doesn't always work with secure hdfs Key: SPARK-1384 URL: https://issues.apache.org/jira/browse/SPARK-1384 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 0.9.0, 0.9.1 Reporter: Thomas Graves I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 rc3 release. It doesn't work with secure HDFS unless you export SPARK_YARN_MODE=true before starting the shell, or if you happen to do something immediately with HDFS. If you wait for the connection to the namenode to timeout it will fail. The fix actually went in to master branch with the authentication changes I made in master but I never realized that change needed to apply to 0.9. https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0 See the SparkILoop diff. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1385) Use existing code-path for JSON de/serialization of BlockId
Andrew Or created SPARK-1385: Summary: Use existing code-path for JSON de/serialization of BlockId Key: SPARK-1385 URL: https://issues.apache.org/jira/browse/SPARK-1385 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0, 0.9.1 Reporter: Andrew Or Priority: Minor Fix For: 1.0.0 BlockId.scala already takes care of JSON de/serialization by parsing the string to and from regex. This functionality is currently duplicated in util/JsonProtocol.scala. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1386) Spark Streaming UI
[ https://issues.apache.org/jira/browse/SPARK-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1386: - Priority: Blocker (was: Major) Spark Streaming UI -- Key: SPARK-1386 URL: https://issues.apache.org/jira/browse/SPARK-1386 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Priority: Blocker When debugging Spark Streaming applications it is necessary to monitor certain metrics that are not shown in the Spark application UI. For example, what is average processing time of batches? What is the scheduling delay? Is the system able to process as fast as it is receiving data? How many records I am receiving through my receivers? While the StreamingListener interface introduced in the 0.9 provided some of this information, it could only be accessed programmatically. A UI that shows information specific to the streaming applications is necessary for easier debugging. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1386) Spark Streaming UI
Tathagata Das created SPARK-1386: Summary: Spark Streaming UI Key: SPARK-1386 URL: https://issues.apache.org/jira/browse/SPARK-1386 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das When debugging Spark Streaming applications it is necessary to monitor certain metrics that are not shown in the Spark application UI. For example, what is average processing time of batches? What is the scheduling delay? Is the system able to process as fast as it is receiving data? How many records I am receiving through my receivers? While the StreamingListener interface introduced in the 0.9 provided some of this information, it could only be accessed programmatically. A UI that shows information specific to the streaming applications is necessary for easier debugging. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1386) Spark Streaming UI
[ https://issues.apache.org/jira/browse/SPARK-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1386: - Affects Version/s: 0.9.0 Spark Streaming UI -- Key: SPARK-1386 URL: https://issues.apache.org/jira/browse/SPARK-1386 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 0.9.0 Reporter: Tathagata Das Priority: Blocker When debugging Spark Streaming applications it is necessary to monitor certain metrics that are not shown in the Spark application UI. For example, what is average processing time of batches? What is the scheduling delay? Is the system able to process as fast as it is receiving data? How many records I am receiving through my receivers? While the StreamingListener interface introduced in the 0.9 provided some of this information, it could only be accessed programmatically. A UI that shows information specific to the streaming applications is necessary for easier debugging. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1332) Improve Spark Streaming's Network Receiver and InputDStream API for future stability
[ https://issues.apache.org/jira/browse/SPARK-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1332: - Priority: Blocker (was: Critical) Improve Spark Streaming's Network Receiver and InputDStream API for future stability Key: SPARK-1332 URL: https://issues.apache.org/jira/browse/SPARK-1332 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 0.9.0 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker The current Network Receiver API makes it slightly complicated to right a new receiver as one needs to create an instance of BlockGenerator as shown in SocketReceiver https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala#L51 Exposing the BlockGenerator interface has made it harder to improve the receiving process. The API of NetworkReceiver (which was not a very stable API anyways) needs to be change if we are to ensure future stability. Additionally, the functions like streamingContext.socketStream that create input streams, return DStream objects. That makes it hard to expose functionality (say, rate limits) unique to input dstreams. They should return InputDStream or NetworkInputDStream. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1387) Update build plugins, avoid plugin version warning, centralize versions
Sean Owen created SPARK-1387: Summary: Update build plugins, avoid plugin version warning, centralize versions Key: SPARK-1387 URL: https://issues.apache.org/jira/browse/SPARK-1387 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Priority: Minor Another handful of small build changes to organize and standardize a bit, and avoid warnings: - Update Maven plugin versions for good measure - Since plugins need maven 3.0.4 already, require it explicitly (3.0.4 had some bugs anyway) - Use variables to define versions across dependencies where they should move in lock step - ... and make this consistent between Maven/SBT -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1367) NPE when joining Parquet Relations
[ https://issues.apache.org/jira/browse/SPARK-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-1367: --- Assignee: Michael Armbrust (was: Andre Schumacher) NPE when joining Parquet Relations -- Key: SPARK-1367 URL: https://issues.apache.org/jira/browse/SPARK-1367 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.0.0 {code} test(self-join parquet files) { val x = ParquetTestData.testData.subquery('x) val y = ParquetTestData.testData.newInstance.subquery('y) val query = x.join(y).where(x.myint.attr === y.myint.attr) query.collect() } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS
[ https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957113#comment-13957113 ] Sean Owen commented on SPARK-1355: -- April Fools, apparently. Though this was opened on 30 March? Switch website to the Apache CMS Key: SPARK-1355 URL: https://issues.apache.org/jira/browse/SPARK-1355 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Joe Schaefer Jekyll is ancient history useful for small blogger sites and little else. Why not upgrade to the Apache CMS? It supports the same on-disk format for .md files and interfaces with pygments for code highlighting. Thrift recently switched from nanoc to the CMS and loves it! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957124#comment-13957124 ] Ken Williams commented on SPARK-1378: - What resolved it on our end was to either run without a local MVN repo (moving the {{~/.m2/settings.xml}} out of the way) or adding the mqtt-repo (https://repo.eclipse.org/content/repositories/paho-releases) to our set of mirrors. Build error: org.eclipse.paho:mqtt-client - Key: SPARK-1378 URL: https://issues.apache.org/jira/browse/SPARK-1378 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.0 Reporter: Ken Williams Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I attempt like so: {code} mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package {code} The Maven error is: {code} [ERROR] Failed to execute goal on project spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus {code} My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4. Is there an additional Maven repository I should add or something? If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, but I would really like to get the examples working because I haven't played with Spark before. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1355) Switch website to the Apache CMS
[ https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe Schaefer closed SPARK-1355. --- Resolution: Invalid Switch website to the Apache CMS Key: SPARK-1355 URL: https://issues.apache.org/jira/browse/SPARK-1355 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Joe Schaefer Jekyll is ancient history useful for small blogger sites and little else. Why not upgrade to the Apache CMS? It supports the same on-disk format for .md files and interfaces with pygments for code highlighting. Thrift recently switched from nanoc to the CMS and loves it! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark
[ https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishkam Ravi updated SPARK-1388: Attachment: (was: Conf_Spark.patch) ConcurrentModificationException in hadoop_common exposed by Spark - Key: SPARK-1388 URL: https://issues.apache.org/jira/browse/SPARK-1388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Nishkam Ravi Attachments: nravi_Conf_Spark-1388.patch The following exception occurs non-deterministically: java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926) at java.util.HashMap$KeyIterator.next(HashMap.java:960) at java.util.AbstractCollection.addAll(AbstractCollection.java:341) at java.util.HashSet.init(HashSet.java:117) at org.apache.hadoop.conf.Configuration.init(Configuration.java:671) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:439) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:154) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1097) ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957233#comment-13957233 ] Nishkam Ravi commented on SPARK-1097: - Attached is a patch for this issue. Verified with mvn test/compile/install. ConcurrentModificationException --- Key: SPARK-1097 URL: https://issues.apache.org/jira/browse/SPARK-1097 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Fabrizio Milo Attachments: nravi_Conf_Spark-1388.patch {noformat} 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to java.util.ConcurrentModificationException java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926) at java.util.HashMap$KeyIterator.next(HashMap.java:960) at java.util.AbstractCollection.addAll(AbstractCollection.java:341) at java.util.HashSet.init(HashSet.java:117) at org.apache.hadoop.conf.Configuration.init(Configuration.java:554) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:439) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:154) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1271) Use Iterator[X] in co-group and group-by signatures
[ https://issues.apache.org/jira/browse/SPARK-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957293#comment-13957293 ] holdenk commented on SPARK-1271: https://github.com/apache/spark/pull/242 Use Iterator[X] in co-group and group-by signatures --- Key: SPARK-1271 URL: https://issues.apache.org/jira/browse/SPARK-1271 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 This API change will allow us to externalize these things down the road. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-939) Allow user jars to take precedence over Spark jars, if desired
[ https://issues.apache.org/jira/browse/SPARK-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957294#comment-13957294 ] holdenk commented on SPARK-939: --- https://github.com/apache/spark/pull/217 Allow user jars to take precedence over Spark jars, if desired -- Key: SPARK-939 URL: https://issues.apache.org/jira/browse/SPARK-939 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: holdenk Priority: Blocker Labels: starter Fix For: 1.0.0 Sometimes a user may want to include their own version of a jar that spark itself uses. For example, if their code requires a newer version of that jar than Spark offers. It would be good to have an option to give the users dependencies precedence over Spark. This options should be disabled by default, since it could lead to some odd behavior (e.g. parts of Spark not working). But I think we should have it. From an implementation perspective, this would require modifying the way we do class loading inside of an Executor. The default behavior of the URLClassLoader is to delegate to it's parent first and, if that fails, to find a class locally. We want to have the opposite behavior. This is sometimes referred to as parent-last (as opposed to parent-first) class loading precedence. There is an example of how to do this here: http://stackoverflow.com/questions/5445511/how-do-i-create-a-parent-last-child-first-classloader-in-java-or-how-to-overr We should write a similar class which can encapsulate a URL classloader and change the delegation order. Or if possible, maybe we could find a more elegant way to do this. See relevant discussion on the user list here: https://groups.google.com/forum/#!topic/spark-users/b278DW3e38g Also see the corresponding option in Hadoop: https://issues.apache.org/jira/browse/MAPREDUCE-4521 Some other relevant Hadoop JIRA's: https://issues.apache.org/jira/browse/MAPREDUCE-1700 https://issues.apache.org/jira/browse/MAPREDUCE-1938 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark
[ https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957330#comment-13957330 ] Sean Owen commented on SPARK-1388: -- Yes this should be resolved as a duplicate instead. ConcurrentModificationException in hadoop_common exposed by Spark - Key: SPARK-1388 URL: https://issues.apache.org/jira/browse/SPARK-1388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Nishkam Ravi Attachments: nravi_Conf_Spark-1388.patch The following exception occurs non-deterministically: java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926) at java.util.HashMap$KeyIterator.next(HashMap.java:960) at java.util.AbstractCollection.addAll(AbstractCollection.java:341) at java.util.HashSet.init(HashSet.java:117) at org.apache.hadoop.conf.Configuration.init(Configuration.java:671) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:439) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:154) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1371) HashAggregate should stream tuples and avoid doing an extra count
[ https://issues.apache.org/jira/browse/SPARK-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957365#comment-13957365 ] Michael Armbrust commented on SPARK-1371: - https://github.com/apache/spark/pull/295 HashAggregate should stream tuples and avoid doing an extra count - Key: SPARK-1371 URL: https://issues.apache.org/jira/browse/SPARK-1371 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1372) Expose in-memory columnar caching for tables.
[ https://issues.apache.org/jira/browse/SPARK-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1372. - Resolution: Fixed Expose in-memory columnar caching for tables. - Key: SPARK-1372 URL: https://issues.apache.org/jira/browse/SPARK-1372 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1364) DataTypes missing from ScalaReflection
[ https://issues.apache.org/jira/browse/SPARK-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957366#comment-13957366 ] Michael Armbrust commented on SPARK-1364: - https://github.com/apache/spark/pull/293 DataTypes missing from ScalaReflection -- Key: SPARK-1364 URL: https://issues.apache.org/jira/browse/SPARK-1364 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.0.0 BigDecimal, possibly others. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings
[ https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat McDonough updated SPARK-1392: - Description: Using the spark-0.9.0 Hadoop2 binary from the project download page, running the spark-shell locally in out of the box configuration, and attempting to cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC overhead limit exceeded You can work around the issue by either decreasing spark.storage.memoryFraction or increasing SPARK_MEM was: Using the spark-0.9.0 Hadoop2 binary from the project download page, running the spark-shell locally in out of the box configuration, and attempting to cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC overhead limit exceeded You can work around the issue by either decreasing {{spark.storage.memoryFraction}} or increasing {{SPARK_MEM}} Local spark-shell Runs Out of Memory With Default Settings -- Key: SPARK-1392 URL: https://issues.apache.org/jira/browse/SPARK-1392 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3 Reporter: Pat McDonough Using the spark-0.9.0 Hadoop2 binary from the project download page, running the spark-shell locally in out of the box configuration, and attempting to cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC overhead limit exceeded You can work around the issue by either decreasing spark.storage.memoryFraction or increasing SPARK_MEM -- This message was sent by Atlassian JIRA (v6.2#6252)