[jira] [Commented] (SPARK-4430) Apache RAT Checks fail spuriously on test files
[ https://issues.apache.org/jira/browse/SPARK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290750#comment-14290750 ] Apache Spark commented on SPARK-4430: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4189 Apache RAT Checks fail spuriously on test files --- Key: SPARK-4430 URL: https://issues.apache.org/jira/browse/SPARK-4430 Project: Spark Issue Type: Bug Components: Build Reporter: Ryan Williams Several of my recent runs of {{./dev/run-tests}} have failed quickly due to Apache RAT checks, e.g.: {code} $ ./dev/run-tests = Running Apache RAT checks = Could not find Apache license headers in the following files: !? /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/28 !? /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/29 !? /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/30 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/10 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/11 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/12 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/13 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/14 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/15 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/16 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/17 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/18 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/19 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/20 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/21 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/22 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/23 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/24 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/25 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/26 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/27 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/28 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/29 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/30 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/7 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/8 !? /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/9 [error] Got a return code of 1 on line 114 of the run-tests script. {code} I think it's fair to say that these are not useful errors for {{run-tests}} to crash on. Ideally we could tell the linter which files we care about having it lint and which we don't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5393) Flood of util.RackResolver log messages after SPARK-1714
[ https://issues.apache.org/jira/browse/SPARK-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290851#comment-14290851 ] Apache Spark commented on SPARK-5393: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/4192 Flood of util.RackResolver log messages after SPARK-1714 Key: SPARK-5393 URL: https://issues.apache.org/jira/browse/SPARK-5393 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical I thought I fixed this while working on the patch, but [~laserson] seems to have encountered it when running on master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
[ https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290852#comment-14290852 ] Apache Spark commented on SPARK-1714: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/4192 Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler Key: SPARK-1714 URL: https://issues.apache.org/jira/browse/SPARK-1714 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4964) Exactly-once semantics for Kafka
[ https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290747#comment-14290747 ] Cody Koeninger commented on SPARK-4964: --- Design doc at https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit?usp=sharing Exactly-once semantics for Kafka Key: SPARK-4964 URL: https://issues.apache.org/jira/browse/SPARK-4964 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Cody Koeninger for background, see http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html Requirements: - allow client code to implement exactly-once end-to-end semantics for Kafka messages, in cases where their output storage is either idempotent or transactional - allow client code access to Kafka offsets, rather than automatically committing them - do not assume Zookeeper as a repository for offsets (for the transactional case, offsets need to be stored in the same store as the data) - allow failure recovery without lost or duplicated messages, even in cases where a checkpoint cannot be restored (for instance, because code must be updated) Design: The basic idea is to make an rdd where each partition corresponds to a given Kafka topic, partition, starting offset, and ending offset. That allows for deterministic replay of data from Kafka (as long as there is enough log retention). Client code is responsible for committing offsets, either transactionally to the same store that data is being written to, or in the case of idempotent data, after data has been written. PR of a sample implementation for both the batch and dstream case is forthcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2285) Give various TaskEndReason subclass more descriptive names
[ https://issues.apache.org/jira/browse/SPARK-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290785#comment-14290785 ] Apache Spark commented on SPARK-2285: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4191 Give various TaskEndReason subclass more descriptive names -- Key: SPARK-2285 URL: https://issues.apache.org/jira/browse/SPARK-2285 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Priority: Minor It is just strange to have org.apache.spark.Success be a TaskEndReason. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4831) Current directory always on classpath with spark-submit
[ https://issues.apache.org/jira/browse/SPARK-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4831. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Daniel Darabos Looks like this was merged in https://github.com/apache/spark/commit/7cb3f54793124c527d62906c565aba2c3544e422 Current directory always on classpath with spark-submit --- Key: SPARK-4831 URL: https://issues.apache.org/jira/browse/SPARK-4831 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.1, 1.2.0 Reporter: Daniel Darabos Assignee: Daniel Darabos Priority: Minor Fix For: 1.3.0 We had a situation where we were launching an application with spark-submit, and a file (play.plugins) was on the classpath twice, causing problems (trying to register plugins twice). Upon investigating how it got on the classpath twice, we found that it was present in one of our jars, and also in the current working directory. But the one in the current working directory should not be on the classpath. We never asked spark-submit to put the current directory on the classpath. I think this is caused by a line in [compute-classpath.sh|https://github.com/apache/spark/blob/v1.2.0-rc2/bin/compute-classpath.sh#L28]: {code} CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH {code} Now if SPARK_CLASSPATH is empty, the empty string is added to the classpath, which means the current working directory. We tried setting SPARK_CLASSPATH to a bogus value, but that is [not allowed|https://github.com/apache/spark/blob/v1.2.0-rc2/core/src/main/scala/org/apache/spark/SparkConf.scala#L312]. What is the right solution? Only add SPARK_CLASSPATH if it's non-empty? I can send a pull request for that I think. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4147) Reduce log4j dependency
[ https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290752#comment-14290752 ] Apache Spark commented on SPARK-4147: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4190 Reduce log4j dependency --- Key: SPARK-4147 URL: https://issues.apache.org/jira/browse/SPARK-4147 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 1.1.0 Reporter: Tobias Pfeiffer spark-core has a hard dependency on log4j, which shouldn't be necessary since slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my sbt file. Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. However, removing the log4j dependency fails because in https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121 a static method of org.apache.log4j.LogManager is accessed *even if* log4j is not in use. I guess removing all dependencies on log4j may be a bigger task, but it would be a great help if the access to LogManager would be done only if log4j use was detected before. (This is a 2-line change.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2280) Java Scala reference docs should describe function reference behavior.
[ https://issues.apache.org/jira/browse/SPARK-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2280: - Priority: Minor (was: Major) Assignee: Sean Owen I'd like to work on this, but, would a change to add a bunch of {{@tparam}} in all the RDD classes going to be welcome or too much merge noise? It's not hard to describe all of these params. Java Scala reference docs should describe function reference behavior. Key: SPARK-2280 URL: https://issues.apache.org/jira/browse/SPARK-2280 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.0.0 Reporter: Hans Uhlig Assignee: Sean Owen Priority: Minor Example K JavaPairRDDK,IterableT groupBy(FunctionT,K f) Return an RDD of grouped elements. Each group consists of a key and a sequence of elements mapping to that key. T and K are not described and there is no explanation of what the function's inputs and outputs should be and how GroupBy uses this information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1960) EOFException when file size 0 exists when use sc.sequenceFile[K,V](path)
[ https://issues.apache.org/jira/browse/SPARK-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1960. -- Resolution: Not a Problem An empty {{SequenceFile}} will still contain some header info. For example when I write an empty one (configured to contain {{LongWritable}}) I get roughly: {code} SEQ^F!org.apache.hadoop.io.LongWritable!org.apache.hadoop.io.LongWritable^A^@*org.apache.hadoop.io.compress.DefaultCodec^@^@^@^@ï9cp84º74K=æÅ3!92^A^F {code} So an empty {{SequenceFile}} is indeed malformed, so I don't think this is a bug. An error is correct. Reopen if I misunderstand. EOFException when file size 0 exists when use sc.sequenceFile[K,V](path) -- Key: SPARK-1960 URL: https://issues.apache.org/jira/browse/SPARK-1960 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Eunsu Yun java.io.EOFException throws when use sc.sequenceFile[K,V] if there is a file which size is 0. I also tested sc.textFile() in the same condition and it does not throw EOFException. val text = sc.sequenceFile[Long, String](data-gz/*.dat.gz) val result = text.filter(filterValid) result.saveAsTextFile(data-out/) -- java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1845) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1759) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1773) at org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:49) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) .. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290847#comment-14290847 ] Dale Richardson commented on SPARK-5388: Hi Andrew, I think the idea is well worth considering. In response to the requirement of making it easier for client- master communications to pass through restrictive fire-walling, have you considered just using Akka's REST gateway (http://doc.akka.io/docs/akka/1.3.1/scala/http.html)? I also have a question if there is an intention for other entities (such as job servers) to communicate with the master at all? If so then the proposed gateway is semantically defined at a fairly low level (just RPC over JSON/HTTP). This is fine if the interface is not going to be exposed to anybody who is not a spark developer with detailed knowledge of spark internals. Did you use the term “REST” to simply mean RPC over JSON/HTTP? Creating a REST interface is more then a HTTP RPC gateway. If the interface is going to be exposed to 3rd parties (such as developers of Job servers and web notebooks etc) then there is a benefit to simplifying some of the exposed application semantics, and exposing an API that is more integrated with HTTP’s protocol semantics which most people are already familiar with - this is what a true REST interface does and if you are defining an endpoint for others to use it is a very powerful concept that allows other people to quickly grasp how to properly use the exposed interface. A rough sketch of a more “REST”ed version of the API would be: *Submit_driver_request* HTTP POST JSON body of request http://host:port/SparkMaster?SubmitDriver Responds with standard HTTP Response including allocated DRIVER_ID if driver submission ok, http error codes with spark specific error if not. *Get status of DRIVER* HTTP GET http://host:port/SparkMaster/Drivers/DRIVER_ID Responds with JSON body containing information on driver execution. If no record of driver_id, then http error code 404 (Not found) returned. *Kill Driver request* HTTP DELETE http://host:port/SparkMaster/Drivers/DRIVER_ID Responds with JSON body containing information on driver kill request, or http error code if an error occurs. I would be happy to prototype something like this up to test the concept out for you if you are looking for something more than just RPC over JSON/HTTP. Provide a stable application submission gateway --- Key: SPARK-5388 URL: https://issues.apache.org/jira/browse/SPARK-5388 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Attachments: Stable Spark Standalone Submission.pdf The existing submission gateway in standalone mode is not compatible across Spark versions. If you have a newer version of Spark submitting to an older version of the standalone Master, it is currently not guaranteed to work. The goal is to provide a stable REST interface to replace this channel. The first cut implementation will target standalone cluster mode because there are very few messages exchanged. The design, however, will be general enough to eventually support this for other cluster managers too. Note that this is not necessarily required in YARN because we already use YARN's stable interface to submit applications there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4697) System properties should override environment variables
[ https://issues.apache.org/jira/browse/SPARK-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4697. -- Resolution: Fixed Fix Version/s: 1.3.0 This looks like it was fixed in https://github.com/apache/spark/commit/9dea64e53ad8df8a3160c0f4010811af1e73dd6f System properties should override environment variables --- Key: SPARK-4697 URL: https://issues.apache.org/jira/browse/SPARK-4697 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Fix For: 1.3.0 I found some arguments in yarn module take environment variables before system properties while the latter override the former in core module. This should be changed in org.apache.spark.deploy.yarn.ClientArguments and org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1029) spark Window shell script errors regarding shell script location reference
[ https://issues.apache.org/jira/browse/SPARK-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1029. -- Resolution: Fixed Fix Version/s: 1.0.0 Looks like this was fixed in https://github.com/apache/spark/commit/4e510b0b0c8a69cfe0ee037b37661caf9bf1d057 for 1.0.0 spark Window shell script errors regarding shell script location reference -- Key: SPARK-1029 URL: https://issues.apache.org/jira/browse/SPARK-1029 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 0.9.0 Reporter: Qiuzhuang Lian Priority: Minor Fix For: 1.0.0 When launch spark-shell.cmd in Window 7, I got following errors E:\projects\amplab\incubator-sparkbin\spark-shell.cmd 'E:\projects\amplab\incubator-spark\bin\..\sbin\spark-class2.cmd' is not recognized as an internal or external command, operable program or batch file. E:\projects\amplab\incubator-sparkbin\spark-shell.cmd 'E:\projects\amplab\incubator-spark\bin\..\sbin\compute-classpath.cmd' is not recognized as an internal or external co mmand, operable program or batch file. Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/repl/Main I am attaching my patches, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4147) Reduce log4j dependency
[ https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4147: - Summary: Reduce log4j dependency (was: Remove log4j dependency) Reduce log4j dependency --- Key: SPARK-4147 URL: https://issues.apache.org/jira/browse/SPARK-4147 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 1.1.0 Reporter: Tobias Pfeiffer spark-core has a hard dependency on log4j, which shouldn't be necessary since slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my sbt file. Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. However, removing the log4j dependency fails because in https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121 a static method of org.apache.log4j.LogManager is accessed *even if* log4j is not in use. I guess removing all dependencies on log4j may be a bigger task, but it would be a great help if the access to LogManager would be done only if log4j use was detected before. (This is a 2-line change.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4147) Remove log4j dependency
[ https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290751#comment-14290751 ] Sean Owen commented on SPARK-4147: -- [~tgpfeiffer] Yeah that's a good change, since all code hits {{Logging}} quickly. It is certainly not the only direct use of log4j, but maybe this actually makes the issue go away for some subset of use cases. I'll make a PR. [~nemccarthy] I don't think it forces log4j on callers, since you can reroute calls to log4j to slf4j. Yes it's extra plumbing. There's not another way to control log levels though, since there is no API for it in slf4j. Remove log4j dependency --- Key: SPARK-4147 URL: https://issues.apache.org/jira/browse/SPARK-4147 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 1.1.0 Reporter: Tobias Pfeiffer spark-core has a hard dependency on log4j, which shouldn't be necessary since slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my sbt file. Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. However, removing the log4j dependency fails because in https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121 a static method of org.apache.log4j.LogManager is accessed *even if* log4j is not in use. I guess removing all dependencies on log4j may be a bigger task, but it would be a great help if the access to LogManager would be done only if log4j use was detected before. (This is a 2-line change.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4491) Using sbt assembly with spark as dep requires Phd in sbt
[ https://issues.apache.org/jira/browse/SPARK-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4491. -- Resolution: Won't Fix I don't see a reason here that indicates the Spark build should change its behavior. I do agree that this build and its deps are hairy, and harmonizing dependencies within Spark and with an app is often a deep rabbit hole. Using sbt assembly with spark as dep requires Phd in sbt Key: SPARK-4491 URL: https://issues.apache.org/jira/browse/SPARK-4491 Project: Spark Issue Type: Question Reporter: sam I get the dreaded deduplicate error from sbt. I resolved the issue (I think, I managed to run the SimpleApp example) here http://stackoverflow.com/a/27018691/1586965 My question is, is this wise? What is wrong with changing the `deduplicate` bit to `first`. Why isn't it this by default? If this isn't the way to make it work, please could someone provide an explanation of the correct way with .sbt examples. Having googled, every example I see is different because it changes depending on what deps the person has ... surely there has to be an automagic way of doing it (if my way isn't it)? One final point, SBT seems to be blaming Spark for causing the problem in their documentation: https://github.com/sbt/sbt-assembly is this fair? Is Spark doing something wrong in the way they build their jars? Or should SBT be renamed to CBT (Complicated Build Tool that will make you need Cognitive Behavioural Therapy after use). NOTE: Satire JFF, really I love both SBT Spark :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5398) Support the eu-central-1 region for spark-ec2
Nicholas Chammas created SPARK-5398: --- Summary: Support the eu-central-1 region for spark-ec2 Key: SPARK-5398 URL: https://issues.apache.org/jira/browse/SPARK-5398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor {{spark-ec2}} [doesn't currently support|https://github.com/mesos/spark-ec2/tree/branch-1.3/ami-list] the {{eu-central-1}} region. You can see the [full list of EC2 regions here|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html]. {{eu-central-1}} is the only one missing as of Jan 2015. ({{cn-north-1}}, for some reason, is not listed there.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5399) tree Losses strings should match loss names
Joseph K. Bradley created SPARK-5399: Summary: tree Losses strings should match loss names Key: SPARK-5399 URL: https://issues.apache.org/jira/browse/SPARK-5399 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0, 1.2.1 Reporter: Joseph K. Bradley Priority: Minor tree.loss.Losses.fromString expects certain String names for losses. These do not match the names of the loss classes but should. I believe these strings were the original names of the losses, and we forgot to correct the strings when we renamed the losses. Currently: {code} case leastSquaresError = SquaredError case leastAbsoluteError = AbsoluteError case logLoss = LogLoss {code} Proposed: {code} case SquaredError = SquaredError case AbsoluteError = AbsoluteError case LogLoss = LogLoss {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8
[ https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3359: - Assignee: (was: Sean Owen) I spent more time on this tonight, mostly looking at the {{genjavadoc}} code, and I don't think this can be made to work, not without both touching up hundreds of scaladoc comments, and overhauling {{genjavadoc}}. The rough translation it does works for javadoc 7, but not nearly for the stricter javadoc 8. It wouldn't be a matter of small fixes. Realistically I'd suggest using javadoc 7, or, altering the doc generation to produce javadoc and scaladoc separately rather than try to get unidoc to work. In the meantime I can submit a PR with a number of small fixes that at least resolve more javadoc 8 errors. `sbt/sbt unidoc` doesn't work with Java 8 - Key: SPARK-3359 URL: https://issues.apache.org/jira/browse/SPARK-3359 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Xiangrui Meng Priority: Minor It seems that Java 8 is stricter on JavaDoc. I got many error messages like {code} [error] /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2: error: modifier private not allowed here [error] private abstract interface SparkHadoopMapRedUtil { [error] ^ {code} This is minor because we can always use Java 6/7 to generate the doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290971#comment-14290971 ] Travis Galoppo commented on SPARK-5400: --- Hmm. This has me thinking in a different direction. We could generalize the expectation-maximization algorithm to work with any mixture model supporting a set of necessary likelihood compute/update methods... then we could ask for, e.g., new ExpectationMaximization[GaussianMixtureModel]. This would de-couple the model and the algorithm, and could open the door for the implementation to be applied to (for instance) tomographic image reconstruction (which seems like a great fit for Spark given the volume of data involved). Rename GaussianMixtureEM to GaussianMixture --- Key: SPARK-5400 URL: https://issues.apache.org/jira/browse/SPARK-5400 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor GaussianMixtureEM is following the old naming convention of including the optimization algorithm name in the class title. We should probably rename it to GaussianMixture so that it can use other optimization algorithms in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5401) Executor ID should be set before MetricsSystem is created
[ https://issues.apache.org/jira/browse/SPARK-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291007#comment-14291007 ] Apache Spark commented on SPARK-5401: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4194 Executor ID should be set before MetricsSystem is created - Key: SPARK-5401 URL: https://issues.apache.org/jira/browse/SPARK-5401 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams MetricsSystem construction [attempts to namespace metrics from each executor using that executor's ID|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L131]. The ID is [currently set at Executor construction time|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L76-L79] (uncoincidentally, just before the {{ExecutorSource}} is registered), but this is after the {{MetricsSystem}} has been initialized (which [happens during {{SparkEnv}} construction|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/SparkEnv.scala#L323-L332], which itself happens during {{ExecutorBackend}} construction, *before* {{Executor}} construction). I noticed this problem because I wasn't seeing any JVM metrics from my executors in a Graphite dashboard I've set up; turns out all the executors (and the driver) were namespacing their metrics under driver, and Graphite responds to such a situation by only taking the last value it receives for each metric within a configurable time window (e.g. 10s). I was seeing per-executor metrics, properly namespaced with each executor's ID, from {{ExecutorSource}}, which as I mentioned above is registered after the executor ID is set. I have a one-line fix for this that I will submit shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5402) Log executor ID at executor-construction time
[ https://issues.apache.org/jira/browse/SPARK-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291010#comment-14291010 ] Apache Spark commented on SPARK-5402: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/4195 Log executor ID at executor-construction time - Key: SPARK-5402 URL: https://issues.apache.org/jira/browse/SPARK-5402 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor One stumbling block I've hit while debugging Spark-on-YARN jobs is that {{yarn logs}} presents each executor's stderr/stdout by container name, but I often need to find the logs for a specific executor ID; the executor ID isn't printed anywhere convenient in each executor's logs, afaict. I added a simple {{logInfo}} to {{Executor.scala}} locally and it's been useful, so I'd like to merge it upstream. PR forthcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5401) Executor ID should be set before MetricsSystem is created
[ https://issues.apache.org/jira/browse/SPARK-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams updated SPARK-5401: - Description: MetricsSystem construction [attempts to namespace metrics from each executor using that executor's ID|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L131]. The ID is [currently set at Executor construction time|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L76-L79] (uncoincidentally, just before the {{ExecutorSource}} is registered), but this is after the {{MetricsSystem}} has been initialized (which [happens during {{SparkEnv}} construction|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/SparkEnv.scala#L323-L332], which itself happens during {{ExecutorBackend}} construction, *before* {{Executor}} construction). I noticed this problem because I wasn't seeing any JVM metrics from my executors in a Graphite dashboard I've set up; turns out all the executors (and the driver) were namespacing their metrics under driver, and Graphite responds to such a situation by only taking the last value it receives for each metric within a configurable time window (e.g. 10s). I was seeing per-executor metrics, properly namespaced with each executor's ID, from {{ExecutorSource}}, which as I mentioned above is registered after the executor ID is set. I have a one-line fix for this that I will submit shortly. was: MetricsSystem construction [attempts to namespace metrics from each executor using that executor's ID|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L131]. The ID is [currently set at Executor construction time|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L76-L79] (uncoincidentally, just before the `ExecutorSource` is registered), but this is after the `MetricsSystem` has been initialized (which [happens during `SparkEnv` construction|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/SparkEnv.scala#L323-L332], which itself happens during `ExecutorBackend` construction, *before* `Executor` construction). I noticed this problem because I wasn't seeing any JVM metrics from my executors in a Graphite dashboard I've set up; turns out all the executors (and the driver) were namespacing their metrics under driver, and Graphite responds to such a situation by only taking the last value it receives for each metric within a configurable time window (e.g. 10s). I was seeing per-executor metrics, properly namespaced with each executor's ID, from `ExecutorSource`, which as I mentioned above is registered after the executor ID is set. I have a one-line fix for this that I will submit shortly. Executor ID should be set before MetricsSystem is created - Key: SPARK-5401 URL: https://issues.apache.org/jira/browse/SPARK-5401 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams MetricsSystem construction [attempts to namespace metrics from each executor using that executor's ID|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L131]. The ID is [currently set at Executor construction time|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L76-L79] (uncoincidentally, just before the {{ExecutorSource}} is registered), but this is after the {{MetricsSystem}} has been initialized (which [happens during {{SparkEnv}} construction|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/SparkEnv.scala#L323-L332], which itself happens during {{ExecutorBackend}} construction, *before* {{Executor}} construction). I noticed this problem because I wasn't seeing any JVM metrics from my executors in a Graphite dashboard I've set up; turns out all the executors (and the driver) were namespacing their metrics under driver, and Graphite responds to such a situation by only taking the last value it receives for each metric within a configurable time window (e.g. 10s). I was seeing per-executor metrics, properly namespaced with each executor's ID, from {{ExecutorSource}}, which as I mentioned above is registered after the executor ID is set. I
[jira] [Created] (SPARK-5402) Log executor ID at executor-construction time
Ryan Williams created SPARK-5402: Summary: Log executor ID at executor-construction time Key: SPARK-5402 URL: https://issues.apache.org/jira/browse/SPARK-5402 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Ryan Williams Priority: Minor One stumbling block I've hit while debugging Spark-on-YARN jobs is that {{yarn logs}} presents each executor's stderr/stdout by container name, but I often need to find the logs for a specific executor ID; the executor ID isn't printed anywhere convenient in each executor's logs, afaict. I added a simple {{logInfo}} to {{Executor.scala}} locally and it's been useful, so I'd like to merge it upstream. PR forthcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5235) Determine serializability of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5235. -- Resolution: Fixed Fix Version/s: 1.3.0 This was merged in https://github.com/apache/spark/commit/2fd7f72b6b0b24bec12331c7bbbcf6bfc265d2ec Determine serializability of SQLContext --- Key: SPARK-5235 URL: https://issues.apache.org/jira/browse/SPARK-5235 Project: Spark Issue Type: Sub-task Reporter: Alex Baretta Fix For: 1.3.0 The SQLConf field in SQLContext is neither Serializable nor transient. Here's the stack trace I get when running SQL queries against a Parquet file. {code} Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.SQLConf at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290874#comment-14290874 ] Sean Owen commented on SPARK-4452: -- Can this JIRA be resolved now that its children are resolved, or is the more to this one? Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Tianshuo Deng Assignee: Tianshuo Deng Priority: Critical When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. Currently, ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes: 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. Previously the spillable objects trigger spilling by themselves. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to. 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290885#comment-14290885 ] Sandy Ryza commented on SPARK-4452: --- I think there's more to this one, the subtasks solved the most egregious issues, but shuffle data structures can still hog memory in detrimental ways described in some of the comments above. Shuffle data structures can starve others on the same thread for memory Key: SPARK-4452 URL: https://issues.apache.org/jira/browse/SPARK-4452 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Tianshuo Deng Assignee: Tianshuo Deng Priority: Critical When an Aggregator is used with ExternalSorter in a task, spark will create many small files and could cause too many files open error during merging. Currently, ShuffleMemoryManager does not work well when there are 2 spillable objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used by Aggregator) in this case. Here is an example: Due to the usage of mapside aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may ask as much memory as it can, which is totalMem/numberOfThreads. Then later on when ExternalSorter is created in the same thread, the ShuffleMemoryManager could refuse to allocate more memory to it, since the memory is already given to the previous requested object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling small files(due to the lack of memory) I'm currently working on a PR to address these two issues. It will include following changes: 1. The ShuffleMemoryManager should not only track the memory usage for each thread, but also the object who holds the memory 2. The ShuffleMemoryManager should be able to trigger the spilling of a spillable object. In this way, if a new object in a thread is requesting memory, the old occupant could be evicted/spilled. Previously the spillable objects trigger spilling by themselves. So one may not trigger spilling even if another object in the same thread needs more memory. After this change The ShuffleMemoryManager could trigger the spilling of an object if it needs to. 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously ExternalAppendOnlyMap returns an destructive iterator and can not be spilled after the iterator is returned. This should be changed so that even after the iterator is returned, the ShuffleMemoryManager can still spill it. Currently, I have a working branch in progress: https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made change 3 and have a prototype of change 1 and 2 to evict spillable from memory manager, still in progress. I will send a PR when it's done. Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8
[ https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290934#comment-14290934 ] Apache Spark commented on SPARK-3359: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4193 `sbt/sbt unidoc` doesn't work with Java 8 - Key: SPARK-3359 URL: https://issues.apache.org/jira/browse/SPARK-3359 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Xiangrui Meng Priority: Minor It seems that Java 8 is stricter on JavaDoc. I got many error messages like {code} [error] /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2: error: modifier private not allowed here [error] private abstract interface SparkHadoopMapRedUtil { [error] ^ {code} This is minor because we can always use Java 6/7 to generate the doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290961#comment-14290961 ] ding commented on SPARK-4105: - I hit this error when using pagerank(It cannot be consistent repro as I only hit once). I am not using the KryoSerializer but I am using the default serializer. The Spark code is get from chunk at 2015/1/19 which should be later than spark 1.2.0. 15/01/23 23:32:57 WARN scheduler.TaskSetManager: Lost task 347.0 in stage 9461.0 (TID 302687, sr213): FetchFailed(BlockManagerId(13, sr207, 49805), shuffleId=399, mapId=461, reduceId=347, message= org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.graphx.impl.VertexPartitionBaseOps.aggregateUsingIndex(VertexPartitionBaseOps.scala:207) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$5$$anonfun$apply$4.apply(VertexRDDImpl.scala:171) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$5$$anonfun$apply$4.apply(VertexRDDImpl.scala:171) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:113) at org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:111) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:65) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) at org.xerial.snappy.Snappy.uncompress(Snappy.java:480) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:143) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1165) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:299) at scala.util.Success$$anonfun$map$1.apply(Try.scala:206) at scala.util.Try$.apply(Try.scala:161) at scala.util.Success.map(Try.scala:206) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL:
[jira] [Commented] (SPARK-3489) support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params
[ https://issues.apache.org/jira/browse/SPARK-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290967#comment-14290967 ] Mohit Jaggi commented on SPARK-3489: pull request does exist here: https://github.com/apache/spark/pull/2429 use case example: https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DFUtil.scala#L86 support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params -- Key: SPARK-3489 URL: https://issues.apache.org/jira/browse/SPARK-3489 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: Mohit Jaggi Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4642) Documents about running-on-YARN needs update
[ https://issues.apache.org/jira/browse/SPARK-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4642. -- Resolution: Fixed Fix Version/s: 1.2.1 1.1.2 1.3.0 Assignee: Masayoshi TSUZUKI Looks like this went in to master, and branch 1.2 / 1.1: https://github.com/apache/spark/commit/692f49378f7d384d5c9c5ab7451a1c1e66f91c50 Documents about running-on-YARN needs update Key: SPARK-4642 URL: https://issues.apache.org/jira/browse/SPARK-4642 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.0 Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Priority: Minor Fix For: 1.3.0, 1.1.2, 1.2.1 Documents about running-on-YARN needs update There are some parameters missing in the document about running-on-YARN page. We need to add the descriptions about the following parameters: - spark.yarn.report.interval - spark.yarn.queue - spark.yarn.user.classpath.first - spark.yarn.scheduler.reporterThread.maxFailures And the description about this default parameter is not strictly accurate: - spark.yarn.submit.file.replication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5028) Add total received and processed records metrics to Streaming UI
[ https://issues.apache.org/jira/browse/SPARK-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5028. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Saisai Shao Also one that was merged already: https://github.com/apache/spark/commit/fdc2aa4918fd4c510f04812b782cc0bfef9a2107 Add total received and processed records metrics to Streaming UI Key: SPARK-5028 URL: https://issues.apache.org/jira/browse/SPARK-5028 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Saisai Shao Assignee: Saisai Shao Fix For: 1.3.0 Followed by SPARK-4537 to add total received records and total processed records in Streaming web ui. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5235) Determine serializability of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290905#comment-14290905 ] Reynold Xin edited comment on SPARK-5235 at 1/25/15 1:00 AM: - Sean - this was not done. We merged a patch to make it serializable again, but for 1.3 we should decide whether we want it to be serializable for real. was (Author: rxin): Sean - this was not done. We merged a patch to make it serializable again, but for 1.3 we should decide whether we wanted to be serializable for real. Determine serializability of SQLContext --- Key: SPARK-5235 URL: https://issues.apache.org/jira/browse/SPARK-5235 Project: Spark Issue Type: Sub-task Reporter: Alex Baretta Fix For: 1.3.0 The SQLConf field in SQLContext is neither Serializable nor transient. Here's the stack trace I get when running SQL queries against a Parquet file. {code} Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.SQLConf at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables
[ https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290962#comment-14290962 ] Imran Rashid commented on SPARK-3298: - If {{allowOverwrite}} defaulted to {{true}}, wouldn't that be closer to keeping the existing behavior, but still allow someone to request the check if they wanted it? Maybe I don't properly understand the current behavior, but it seems like it will effectively uncache the existing table and create a new one (even if the uncaching is happening later by the context cleaner). [SQL] registerAsTable / registerTempTable overwrites old tables --- Key: SPARK-3298 URL: https://issues.apache.org/jira/browse/SPARK-3298 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Evan Chan Priority: Minor Labels: newbie At least in Spark 1.0.2, calling registerAsTable(a) when a had been registered before does not cause an error. However, there is no way to access the old table, even though it may be cached and taking up space. How about at least throwing an error? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4934) Connection key is hard to read
[ https://issues.apache.org/jira/browse/SPARK-4934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen resolved SPARK-4934. -- Resolution: Not a Problem Connection key is hard to read -- Key: SPARK-4934 URL: https://issues.apache.org/jira/browse/SPARK-4934 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.1 Reporter: Hong Shen When I run a big spark job, executor have a lot of log, 14/12/23 15:25:31 INFO network.ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@52b0e278 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:310) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) It's hard to know which connection is cancelled. maby we can change to logInfo(Connection already cancelled ? + con.getRemoteAddress(), e) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5038) Add explicit return type for all implicit functions
[ https://issues.apache.org/jira/browse/SPARK-5038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5038. -- Resolution: Fixed Looks like this was merged in https://github.com/apache/spark/commit/c88a3d7fca20d36ee566d48e0cb91fe33a7a6d99 and https://github.com/apache/spark/commit/7749dd6c36a182478b20f4636734c8db0b7ddb00 Add explicit return type for all implicit functions --- Key: SPARK-5038 URL: https://issues.apache.org/jira/browse/SPARK-5038 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Fix For: 1.3.0 As we learned in https://github.com/apache/spark/pull/3580, not explicitly typing implicit functions can lead to compiler bugs and potentially unexpected runtime behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5074) Fix a non-deterministic test in org.apache.spark.scheduler.DAGSchedulerSuite
[ https://issues.apache.org/jira/browse/SPARK-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5074. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Shixiong Zhu This was merged in https://github.com/apache/spark/commit/5c506cecb933b156b2f06a688ee08c4347bf0d47 Fix a non-deterministic test in org.apache.spark.scheduler.DAGSchedulerSuite Key: SPARK-5074 URL: https://issues.apache.org/jira/browse/SPARK-5074 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Labels: flaky-test Fix For: 1.3.0 fix the following non-deterministic test in org.apache.spark.scheduler.DAGSchedulerSuite {noformat} [info] DAGSchedulerSuite: [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** (27 milliseconds) [info] 1 did not equal 2 (DAGSchedulerSuite.scala:242) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) [info] at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2.apply$mcV$sp(DAGSchedulerSuite.scala:242) [info] at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2.apply(DAGSchedulerSuite.scala:239) [info] at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2.apply(DAGSchedulerSuite.scala:239) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.apache.spark.scheduler.DAGSchedulerSuite.org$scalatest$BeforeAndAfter$$super$runTest(DAGSchedulerSuite.scala:60) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5131) A typo in configuration doc
[ https://issues.apache.org/jira/browse/SPARK-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5131. -- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: uncleGen The PR was resolved for master and 1.2: https://github.com/apache/spark/commit/39e333ec4350ddafe29ee0958c37eec07bec85df A typo in configuration doc --- Key: SPARK-5131 URL: https://issues.apache.org/jira/browse/SPARK-5131 Project: Spark Issue Type: Bug Reporter: uncleGen Assignee: uncleGen Priority: Minor Fix For: 1.3.0, 1.2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5235) Determine serializability of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-5235: Sean - this was not done. We merged a patch to make it serializable again, but for 1.3 we should decide whether we wanted to be serializable for real. Determine serializability of SQLContext --- Key: SPARK-5235 URL: https://issues.apache.org/jira/browse/SPARK-5235 Project: Spark Issue Type: Sub-task Reporter: Alex Baretta Fix For: 1.3.0 The SQLConf field in SQLContext is neither Serializable nor transient. Here's the stack trace I get when running SQL queries against a Parquet file. {code} Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.SQLConf at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) Determine serializability of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290921#comment-14290921 ] Sean Owen commented on SPARK-5235: -- Sounds good, keep it open. This particular change was merged, but this can track the broader question, yes. Determine serializability of SQLContext --- Key: SPARK-5235 URL: https://issues.apache.org/jira/browse/SPARK-5235 Project: Spark Issue Type: Sub-task Reporter: Alex Baretta Fix For: 1.3.0 The SQLConf field in SQLContext is neither Serializable nor transient. Here's the stack trace I get when running SQL queries against a Parquet file. {code} Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.SQLConf at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
Joseph K. Bradley created SPARK-5400: Summary: Rename GaussianMixtureEM to GaussianMixture Key: SPARK-5400 URL: https://issues.apache.org/jira/browse/SPARK-5400 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor GaussianMixtureEM is following the old naming convention of including the optimization algorithm name in the class title. We should probably rename it to GaussianMixture so that it can use other optimization algorithms in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER
[ https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290923#comment-14290923 ] Florian Verhein commented on SPARK-3185: Sure [~grzegorz-dubicki]. You need to build with the correct version profiles. See for example: https://github.com/florianverhein/spark-ec2/blob/packer/spark/init.sh https://github.com/florianverhein/spark-ec2/blob/packer/tachyon/init.sh Note that I'm using Hadoop 2.4.1 (which I install on the image). SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER --- Key: SPARK-3185 URL: https://issues.apache.org/jira/browse/SPARK-3185 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Amazon Linux AMI [ec2-user@ip-172-30-1-145 ~]$ uname -a Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/ The build I used (and MD5 verified): [ec2-user@ip-172-30-1-145 ~]$ wget http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz Reporter: Jeremy Chambers {code} org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 {code} When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon exception is thrown when Formatting JOURNAL_FOLDER. No exception occurs when I launch on Hadoop 1. Launch used: {code} ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch sparkProd {code} {code} log snippet Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/ Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73) at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53) at tachyon.UnderFileSystem.get(UnderFileSystem.java:53) at tachyon.Format.main(Format.java:54) Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69) ... 3 more Killed 0 processes Killed 0 processes ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes ---end snippet--- {code} *I don't have this problem when I launch without the --hadoop-major-version=2 (which defaults to Hadoop 1.x).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290924#comment-14290924 ] Joseph K. Bradley commented on SPARK-5400: -- [~mengxr] [~tgaloppo] What do you think? Rename GaussianMixtureEM to GaussianMixture --- Key: SPARK-5400 URL: https://issues.apache.org/jira/browse/SPARK-5400 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor GaussianMixtureEM is following the old naming convention of including the optimization algorithm name in the class title. We should probably rename it to GaussianMixture so that it can use other optimization algorithms in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290744#comment-14290744 ] Apache Spark commented on SPARK-4267: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4188 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later -- Key: SPARK-4267 URL: https://issues.apache.org/jira/browse/SPARK-4267 Project: Spark Issue Type: Bug Reporter: Tsuyoshi OZAWA Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN fails to launch jobs with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
[ https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290742#comment-14290742 ] Sean Owen commented on SPARK-4267: -- The warning is from YARN, I believe, rather than Spark. Yeah maybe should be an error. Your info however points to the problem; I'm sure it's {{-Dnumbers=one two three}}. {{Utils.splitCommandString}} strips quotes as it parses them, so will turn it into {{-Dnumbers=one two three}} so the command is becoming {{java -Dnumbers=one two three ...}} and this isn't valid. I suggest that {{Utils.splitCommandString}} not strip the quotes that it parses, so that the reconstructed command line is exactly like the original. It's just splitting, not interpreting the command. This also seems less surprising. PR coming to demonstrate. Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later -- Key: SPARK-4267 URL: https://issues.apache.org/jira/browse/SPARK-4267 Project: Spark Issue Type: Bug Reporter: Tsuyoshi OZAWA Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this: {code} ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0 {code} Then Spark on YARN fails to launch jobs with NPE. {code} $ bin/spark-shell --master yarn-client scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2); java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291) at org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480) at $iwC$$iwC$$iwC$$iwC.init(console:13) at $iwC$$iwC$$iwC.init(console:18) at $iwC$$iwC.init(console:20) at $iwC.init(console:22) at init(console:24) at .init(console:28) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at
[jira] [Resolved] (SPARK-2105) SparkUI doesn't remove active stages that failed
[ https://issues.apache.org/jira/browse/SPARK-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2105. -- Resolution: Fixed Fix Version/s: 1.1.0 It appears this is considered fixed by that commit, for 1.1.0, and it has not been reproducible otherwise. SparkUI doesn't remove active stages that failed Key: SPARK-2105 URL: https://issues.apache.org/jira/browse/SPARK-2105 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.1.0 If a stage fails because its tasks cannot be serialized, for instance, the failed stage remains in the Active Stages section forever. This is because the StageCompleted event is never posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5395) Large number of Python workers causing resource depletion
[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sven Krasser updated SPARK-5395: Description: During job execution a large number of Python worker accumulates eventually causing YARN to kill containers for being over their memory allocation (in the case below that is about 8G for executors plus 6G for overhead per container). In this instance, at the time of killing the container 97 pyspark.daemon processes had accumulated. {noformat} 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] {noformat} The configuration used uses 64 containers with 2 cores each. Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Mailinglist discussion: https://www.mail-archive.com/user@spark.apache.org/msg20102.html was: During job execution a large number of Python worker accumulates eventually causing YARN to kill containers for being over their memory allocation (in the case below that is about 8G for executors plus 6G for overhead per container). In this instance, at the time of killing the container 97 pyspark.daemon processes had accumulated. {noformat} 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] {noformat} Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Mailinglist discussion: https://www.mail-archive.com/user@spark.apache.org/msg20102.html Large number of Python workers causing resource depletion - Key: SPARK-5395 URL: https://issues.apache.org/jira/browse/SPARK-5395 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: AWS ElasticMapReduce Reporter: Sven Krasser During job execution a large number of Python worker accumulates eventually causing YARN to kill containers for being over their memory allocation (in the case below that is about 8G for executors plus 6G for overhead per container). In this instance, at the time of killing the container 97 pyspark.daemon processes had accumulated. {noformat} 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] {noformat} The configuration used uses 64 containers with 2 cores each. Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Mailinglist discussion:
[jira] [Created] (SPARK-5395) Large number of Python workers causing resource depletion
Sven Krasser created SPARK-5395: --- Summary: Large number of Python workers causing resource depletion Key: SPARK-5395 URL: https://issues.apache.org/jira/browse/SPARK-5395 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: AWS ElasticMapReduce Reporter: Sven Krasser During job execution a large number of Python worker accumulates eventually causing YARN to kill containers for being over their memory allocation (in the case below that is about 8G for executors plus 6G for overhead per container). In this instance, at the time of killing the container 97 pyspark.daemon processes had accumulated. {noformat} 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler (Logging.scala:logInfo(59)) - Container marked as failed: container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: Container [pid=35211,containerID=container_1421692415636_0052_01_30] is running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container. Dump of the process-tree for container_1421692415636_0052_01_30 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m pyspark.daemon |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m pyspark.daemon |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m pyspark.daemon [...] {noformat} Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c Mailinglist discussion: https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290523#comment-14290523 ] Muhammad-Ali A'rabi commented on SPARK-5226: That's right. For very huge data, it won't be a good implementation. It is O(log n), actually. In preprocessing phase, we created a sorted map or something, and with a radius, we can retrieve all points with less distance in O(log n). If we use the first implementation, for each region query we have to calculate lots of distances, and some of them are surely calculated before. We can have both ways implemented, and user may use any of them depending on their need. We can also use vector with norm and use the upper bound. But I don't trust this method and have to test it. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2285) Give various TaskEndReason subclass more descriptive names
[ https://issues.apache.org/jira/browse/SPARK-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2285: --- Assignee: (was: Reynold Xin) Give various TaskEndReason subclass more descriptive names -- Key: SPARK-2285 URL: https://issues.apache.org/jira/browse/SPARK-2285 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Priority: Minor It is just strange to have org.apache.spark.Success be a TaskEndReason. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2285) Give various TaskEndReason subclass more descriptive names
[ https://issues.apache.org/jira/browse/SPARK-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2285: --- Component/s: Spark Core Give various TaskEndReason subclass more descriptive names -- Key: SPARK-2285 URL: https://issues.apache.org/jira/browse/SPARK-2285 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Priority: Minor It is just strange to have org.apache.spark.Success be a TaskEndReason. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2285) Give various TaskEndReason subclass more descriptive names
[ https://issues.apache.org/jira/browse/SPARK-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290502#comment-14290502 ] Reynold Xin commented on SPARK-2285: Hey Sean - I was thinking TaskSuccess, TaskFailed, etc, would be much better than Success, since Success can mean a lot of things, without looking up the heritage. Give various TaskEndReason subclass more descriptive names -- Key: SPARK-2285 URL: https://issues.apache.org/jira/browse/SPARK-2285 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Reynold Xin Priority: Minor It is just strange to have org.apache.spark.Success be a TaskEndReason. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5396) Syntax error in spark scripts on windows.
Vladimir Protsenko created SPARK-5396: - Summary: Syntax error in spark scripts on windows. Key: SPARK-5396 URL: https://issues.apache.org/jira/browse/SPARK-5396 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Environment: Window 7 and Window 8.1. Reporter: Vladimir Protsenko I made the following steps: 1. downloaded and installed Scala 2.11.5 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean package (in git bash) After installation tried to run spark-shell.cmd in cmd shell and it says there is a syntax error in file. The same with spark-shell2.cmd, spark-submit.cmd and spark-submit2.cmd. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3471) Automatic resource manager for SparkContext in Scala?
[ https://issues.apache.org/jira/browse/SPARK-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3471. -- Resolution: Not a Problem This is about adding some kind of try-with-resources equivalent for Scala? No, there isn't one. I know of the ARM library that provides this functionality: https://github.com/jsuereth/scala-arm In terms of what the Spark code has to do to enable resource management with {{SparkContext}}, there's nothing to do. It implements {{Closeable}} but even that is not necessary for this library to work. So it's something a user app could include if really desired. I don't think there is a change to Spark needed here. Automatic resource manager for SparkContext in Scala? - Key: SPARK-3471 URL: https://issues.apache.org/jira/browse/SPARK-3471 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to add automatic resource management semantics to SparkContext (i.e. with in Python (SPARK-3458), Closeable/AutoCloseable in Java (SPARK-3470)). I have no knowledge of Scala whatsoever, but a quick search seems to indicate that there isn't a standard mechanism for this - someone with real Scala knowledge should take a look and make a decision... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5383) support alias for udfs with multi output columns
[ https://issues.apache.org/jira/browse/SPARK-5383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-5383: --- Summary: support alias for udfs with multi output columns (was: Multi alias names support) support alias for udfs with multi output columns Key: SPARK-5383 URL: https://issues.apache.org/jira/browse/SPARK-5383 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei now spark sql does not support multi alias names, The following sql failed in spark-sql: select key as (k1, k2), value as (v1, v2) from src limit 5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5383) support alias for udfs with multi output columns
[ https://issues.apache.org/jira/browse/SPARK-5383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-5383: --- Description: when a udf output multi columns, now we can not use alias for them in spark-sql, see this flowing sql: select stack(1, key, value, key, value) as (a, b, c, d) from src limit 5; was: now spark sql does not support multi alias names, The following sql failed in spark-sql: select key as (k1, k2), value as (v1, v2) from src limit 5 support alias for udfs with multi output columns Key: SPARK-5383 URL: https://issues.apache.org/jira/browse/SPARK-5383 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei when a udf output multi columns, now we can not use alias for them in spark-sql, see this flowing sql: select stack(1, key, value, key, value) as (a, b, c, d) from src limit 5; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3430) Introduce ValueIncrementableHashMapAccumulator to compute Histogram and other statistical metrics
[ https://issues.apache.org/jira/browse/SPARK-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3430. -- Resolution: Won't Fix PR says this is WontFix Introduce ValueIncrementableHashMapAccumulator to compute Histogram and other statistical metrics - Key: SPARK-3430 URL: https://issues.apache.org/jira/browse/SPARK-3430 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Suraj Satishkumar Sheth Pull request : https://github.com/apache/spark/pull/2314 Currently, we don't have a Hash map which can be used as an accumulator to produce Histogram or distribution. This class will provide a customized HashMap implemetation whose value can be incremented. e.g. map+=(a,1), map+=(a,6) will lead to (a,7) This can have various applications like computation of Histograms, Sampling Strategy generation, Statistical metric computation, in MLLib, etc. Example usage : val map = sc.accumulableCollection(new ValueIncrementableHashMapAccumulator[Int]()) var countMap = sc.broadcast(map) data.foreach(record = { var valArray = record.split(\t) var valString = var i = 0 var tuple = (0,1L) countMap.value += tuple for(valString - valArray) { i = i+1 try{ valString.toDouble var tuple = (i,1L) countMap.value += tuple } catch { case ioe: Exception = None } } }) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5396) Syntax error in spark scripts on windows.
[ https://issues.apache.org/jira/browse/SPARK-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Protsenko updated SPARK-5396: -- Attachment: windows8.1.png windows7.png Syntax error in spark scripts on windows. - Key: SPARK-5396 URL: https://issues.apache.org/jira/browse/SPARK-5396 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Environment: Window 7 and Window 8.1. Reporter: Vladimir Protsenko Attachments: windows7.png, windows8.1.png I made the following steps: 1. downloaded and installed Scala 2.11.5 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean package (in git bash) After installation tried to run spark-shell.cmd in cmd shell and it says there is a syntax error in file. The same with spark-shell2.cmd, spark-submit.cmd and spark-submit2.cmd. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5396) Syntax error in spark scripts on windows.
[ https://issues.apache.org/jira/browse/SPARK-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Protsenko updated SPARK-5396: -- Description: I made the following steps: 1. downloaded and installed Scala 2.11.5 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean package (in git bash) After installation tried to run spark-shell.cmd in cmd shell and it says there is a syntax error in file. The same with spark-shell2.cmd, spark-submit.cmd and spark-submit2.cmd. !windows7.png! was: I made the following steps: 1. downloaded and installed Scala 2.11.5 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean package (in git bash) After installation tried to run spark-shell.cmd in cmd shell and it says there is a syntax error in file. The same with spark-shell2.cmd, spark-submit.cmd and spark-submit2.cmd. Syntax error in spark scripts on windows. - Key: SPARK-5396 URL: https://issues.apache.org/jira/browse/SPARK-5396 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Environment: Window 7 and Window 8.1. Reporter: Vladimir Protsenko Attachments: windows7.png, windows8.1.png I made the following steps: 1. downloaded and installed Scala 2.11.5 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean package (in git bash) After installation tried to run spark-shell.cmd in cmd shell and it says there is a syntax error in file. The same with spark-shell2.cmd, spark-submit.cmd and spark-submit2.cmd. !windows7.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3489) support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params
[ https://issues.apache.org/jira/browse/SPARK-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3489: - Priority: Minor (was: Major) Target Version/s: (was: 1.2.0) This should be a pull request rather than diff pasted in comments. What's the use case for this vs two zips? support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params -- Key: SPARK-3489 URL: https://issues.apache.org/jira/browse/SPARK-3489 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: Mohit Jaggi Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3621. -- Resolution: Not a Problem Given the discussion, this is best solved by reading data directly at the workers, rather than involving the driver, or it is already solvable by broadcasting values collected on the driver. It won't be possible to broadcast an RDD, in any event. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3195) Can you add some statistics to do logistic regression better in mllib?
[ https://issues.apache.org/jira/browse/SPARK-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3195. -- Resolution: Invalid Can you add some statistics to do logistic regression better in mllib? -- Key: SPARK-3195 URL: https://issues.apache.org/jira/browse/SPARK-3195 Project: Spark Issue Type: New Feature Components: MLlib Reporter: miumiu Priority: Minor Original Estimate: 1m Remaining Estimate: 1m HI, In logistic regression model practice,Test of regression coefficient and whole model fitting are very important.Can you add some effective support on these Aspects? Such as,The likelihood ratio test or the wald test is offer used for test of coefficient,and the Hosmer-Lemeshow test is used for evaluate the model fitting. Learning that we have ROC and Precision-Recall already,but can you also provide KS statistic,which is mostly used in Model evaluation aspect? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2442) Add a Hadoop Writable serializer
[ https://issues.apache.org/jira/browse/SPARK-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2442. -- Resolution: Duplicate Add a Hadoop Writable serializer Key: SPARK-2442 URL: https://issues.apache.org/jira/browse/SPARK-2442 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan Using data read from hadoop files in shuffles can cause exceptions with the following stacktrace: {code} java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679) {code} This though seems to go away if Kyro serializer is used. I am wondering if adding a Hadoop-writables friendly serializer makes sense as it is likely to perform better than Kyro without registration, since Writables don't implement Serializable - so the serialization might not be the most efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5397) Assigning aliases to several return values of an UDF
Max created SPARK-5397: -- Summary: Assigning aliases to several return values of an UDF Key: SPARK-5397 URL: https://issues.apache.org/jira/browse/SPARK-5397 Project: Spark Issue Type: Bug Components: SQL Reporter: Max The query with following syntax is no valid SQL in Spark due to the assigment of multiple aliases. So it seems not possible for me to port former HiveQL queries with UDFs returning multiple values to Spark SQL. Query SELECT my_function(param_one, param_two) AS (return_one, return_two, return_three) FROM my_table; Error Unsupported language features in query: SELECT my_function(param_one, param_two) AS (return_one, return_two, return_three) FROM my_table; TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME my_table TOK_SELECT TOK_SELEXPR TOK_FUNCTION my_function TOK_TABLE_OR_COL param_one TOK_TABLE_OR_COL param_two return_one return_two return_three -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3852) Document spark.driver.extra* configs
[ https://issues.apache.org/jira/browse/SPARK-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290597#comment-14290597 ] Apache Spark commented on SPARK-3852: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4185 Document spark.driver.extra* configs Key: SPARK-3852 URL: https://issues.apache.org/jira/browse/SPARK-3852 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Or They are not documented... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3859) Use consistent config names for duration (with units!)
[ https://issues.apache.org/jira/browse/SPARK-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290600#comment-14290600 ] Sean Owen commented on SPARK-3859: -- I double-checked that all of the config properties that are expressed in time, like timeouts, durations, and rates, have their units documented in {{configuration.md}}. IMHO it's probably not worth adding 20 new properties and deprecating 20 and supporting both just to add the units to the property name. Use consistent config names for duration (with units!) -- Key: SPARK-3859 URL: https://issues.apache.org/jira/browse/SPARK-3859 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or There are many configs in Spark that refer to some unit of time. However, from the first glance it is unclear what these units are. We should find a consistent way to append the units to the end of these config names and deprecate the old ones in favor of the more consistent ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3875) Add TEMP DIRECTORY configuration
[ https://issues.apache.org/jira/browse/SPARK-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290602#comment-14290602 ] Sean Owen commented on SPARK-3875: -- You can already set {{java.io.tmpdir}}, without making a new property, and that will control where all Java code puts its temp files. There is already {{spark.local.dir}} which sounds like exactly what you're suggesting. This gets set to a big fast disk because it's where things like shuffle files go. Is the question here perhaps whether a few bits of code that don't use {{spark.local.dir}} should use it? Yes, it looks like it's used by {{HttpBroadcast.scala}} and downloading dependencies. Add TEMP DIRECTORY configuration Key: SPARK-3875 URL: https://issues.apache.org/jira/browse/SPARK-3875 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Liu Currently, the Spark uses java.io.tmpdir to find the /tmp/ directory. Then, the /tmp/ directory is used to 1. Setup the HTTP File Server 2. Broadcast directory 3. Fetch Dependency files or jars by Executors The size of the /tmp/ directory will keep growing. The free space of the system disk will be less. I think we could add a configuration spark.tmp.dir in conf/spark-env.sh or conf/spark-defaults.conf to set this particular directory. Let's say, set the directory to a data disk. If spark.tmp.dir is not set, use the default java.io.tmpdir -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5383) support alias for udfs with multi output columns
[ https://issues.apache.org/jira/browse/SPARK-5383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290630#comment-14290630 ] Apache Spark commented on SPARK-5383: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/4186 support alias for udfs with multi output columns Key: SPARK-5383 URL: https://issues.apache.org/jira/browse/SPARK-5383 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei when a udf output multi columns, now we can not use alias for them in spark-sql, see this flowing sql: select stack(1, key, value, key, value) as (a, b, c, d) from src limit 5; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4283) Spark source code does not correctly import into eclipse
[ https://issues.apache.org/jira/browse/SPARK-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4283. -- Resolution: Won't Fix I suggest resolving this as WontFix since the Maven build is correct and supported, and we have instructions about how to successfully use Spark with Eclipse: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-Eclipse Spark source code does not correctly import into eclipse Key: SPARK-4283 URL: https://issues.apache.org/jira/browse/SPARK-4283 Project: Spark Issue Type: Bug Components: Build Reporter: Yang Yang Priority: Minor Attachments: spark_eclipse.diff when I import spark src into eclipse, either by mvn eclipse:eclipse, then import existing general projects or import existing maven projects, it does not recognize the project as a scala project. I am adding a new plugin , so import works -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3439) Add Canopy Clustering Algorithm
[ https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290634#comment-14290634 ] Muhammad-Ali A'rabi commented on SPARK-3439: Possible implementation: {code:scala} import org.apache.spark.mllib.linalg._ import java.util.HashMap val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0), Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0), Array(0, 0, 1.1)) val vs = vas.map(Vectors.dense(_)) val t1 = 1.0 val t2 = 0.5 // starting canopy val map = new HashMap[Vector, Vector] // map from data to clusters val set = new HashMap[Vector, Boolean] // the set for(v - vs) set.put(v, true) for(v - vs) { if(set.get(v)) { val dists = vs.map{ x = (x, Vectors.sqdist(x, v)) } dists.foreach { case (x, d) = if(d t1) map.put(x, v) if(d t2) set.put(x, false) } } } {code} The algorithm is working with arrays and lists, but all of them could be converted to RDD. Add Canopy Clustering Algorithm --- Key: SPARK-3439 URL: https://issues.apache.org/jira/browse/SPARK-3439 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Assignee: Muhammad-Ali A'rabi Priority: Minor The canopy clustering algorithm is an unsupervised pre-clustering algorithm. It is often used as a preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3782) Direct use of log4j in AkkaUtils interferes with certain logging configurations
[ https://issues.apache.org/jira/browse/SPARK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290594#comment-14290594 ] Sean Owen commented on SPARK-3782: -- Aha, I think there's a good point here. Looks like this method was added to the log4j shim in slf4j 1.7.6: https://github.com/qos-ch/slf4j/commit/004b5d4879a079f3d6f610b7fe339a0fad7d4831 So it should be fine if you use log4j-over-slf4j 1.7.6+ in your app. Spark references slf4j 1.7.5 though. Although I don't think it will matter if you use a different version, we could update Spark's slf4j to 1.7.6 at least, to be really consistent. 1.7.10 is the latest in fact. Direct use of log4j in AkkaUtils interferes with certain logging configurations Key: SPARK-3782 URL: https://issues.apache.org/jira/browse/SPARK-3782 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Martin Gilday AkkaUtils is calling setLevel on Logger from log4j. This causes issues when using another implementation of SLF4J such as logback as log4j-over-slf4j.jars implementation of this class does not contain this method on Logger. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3782) Direct use of log4j in AkkaUtils interferes with certain logging configurations
[ https://issues.apache.org/jira/browse/SPARK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290595#comment-14290595 ] Apache Spark commented on SPARK-3782: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4184 Direct use of log4j in AkkaUtils interferes with certain logging configurations Key: SPARK-3782 URL: https://issues.apache.org/jira/browse/SPARK-3782 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Martin Gilday AkkaUtils is calling setLevel on Logger from log4j. This causes issues when using another implementation of SLF4J such as logback as log4j-over-slf4j.jars implementation of this class does not contain this method on Logger. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3148) Update global variables of HttpBroadcast so that multiple SparkContexts can coexist
[ https://issues.apache.org/jira/browse/SPARK-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3148. -- Resolution: Won't Fix PR says this is WontFix Update global variables of HttpBroadcast so that multiple SparkContexts can coexist --- Key: SPARK-3148 URL: https://issues.apache.org/jira/browse/SPARK-3148 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai Priority: Minor Update global variables of HttpBroadcast so that multiple SparkContexts can coexist -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error
[ https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290632#comment-14290632 ] Sean Owen commented on SPARK-2348: -- [~chiragtodarka] [~Xierqi] The resolution proposed here sounds like the one for SPARK-4161. It looks like a similar, parallel change in {{windows-utils.cmd}} might fix this? You make a pull request on github to propose the change rather than write the diff here. In Windows having a enviorinment variable named 'classpath' gives error --- Key: SPARK-2348 URL: https://issues.apache.org/jira/browse/SPARK-2348 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Windows 7 Enterprise Reporter: Chirag Todarka Assignee: Chirag Todarka Priority: Critical Operating System:: Windows 7 Enterprise If having enviorinment variable named 'classpath' gives then starting 'spark-shell' gives below error:: mydir\spark\binspark-shell Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler acces sed before init set up. Assuming no postInit code. Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. Exception in thread main java.lang.AssertionError: assertion failed: null at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca la:202) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar kILoop.scala:929) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass Loader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5297) JavaStreamingContext.fileStream won't work because type info isn't propagated
[ https://issues.apache.org/jira/browse/SPARK-5297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5297: - Summary: JavaStreamingContext.fileStream won't work because type info isn't propagated (was: File Streams do not work with custom key/values) JavaStreamingContext.fileStream won't work because type info isn't propagated - Key: SPARK-5297 URL: https://issues.apache.org/jira/browse/SPARK-5297 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Leonidas Fegaras Assignee: Saisai Shao Labels: backport-needed Fix For: 1.3.0 The following code: {code} stream_context.K,V,SequenceFileInputFormatK,VfileStream(directory) .foreachRDD(new FunctionJavaPairRDDK,V,Void() { public Void call ( JavaPairRDDK,V rdd ) throws Exception { for ( Tuple2K,V x: rdd.collect() ) System.out.println(# +x._1+ +x._2); return null; } }); stream_context.start(); stream_context.awaitTermination(); {code} for custom (serializable) classes K and V compiles fine but gives an error when I drop a new hadoop sequence file in the directory: {quote} 15/01/17 09:13:59 ERROR scheduler.JobScheduler: Error generating jobs for time 1421507639000 ms java.lang.ClassCastException: java.lang.Object cannot be cast to org.apache.hadoop.mapreduce.InputFormat at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:91) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:236) at org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:234) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:128) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:296) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288) at scala.Option.orElse(Option.scala:257) {quote} The same classes K and V work fine for non-streaming Spark: {code} spark_context.newAPIHadoopFile(path,F.class,K.class,SequenceFileInputFormat.class,conf) {code} also streaming works fine for TextFileInputFormat. The issue is that class manifests are erased to object in the Java file stream constructor, but those are relied on downstream when creating the Hadoop RDD that backs each batch of the file stream. https://github.com/apache/spark/blob/v1.2.0/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala#L263 https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L753 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3439) Add Canopy Clustering Algorithm
[ https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290634#comment-14290634 ] Muhammad-Ali A'rabi edited comment on SPARK-3439 at 1/24/15 2:41 PM: - Possible implementation: {code:java} import org.apache.spark.mllib.linalg._ import java.util.HashMap val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0), Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0), Array(0, 0, 1.1)) val vs = vas.map(Vectors.dense(_)) val t1 = 1.0 val t2 = 0.5 // starting canopy val map = new HashMap[Vector, Vector] // map from data to clusters val set = new HashMap[Vector, Boolean] // the set for(v - vs) set.put(v, true) for(v - vs) { if(set.get(v)) { val dists = vs.map{ x = (x, Vectors.sqdist(x, v)) } dists.foreach { case (x, d) = if(d t1) map.put(x, v) if(d t2) set.put(x, false) } } } {code} The algorithm is working with arrays and lists, but all of them could be converted to RDD. was (Author: angellandros): Possible implementation: {code:scala} import org.apache.spark.mllib.linalg._ import java.util.HashMap val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0), Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0), Array(0, 0, 1.1)) val vs = vas.map(Vectors.dense(_)) val t1 = 1.0 val t2 = 0.5 // starting canopy val map = new HashMap[Vector, Vector] // map from data to clusters val set = new HashMap[Vector, Boolean] // the set for(v - vs) set.put(v, true) for(v - vs) { if(set.get(v)) { val dists = vs.map{ x = (x, Vectors.sqdist(x, v)) } dists.foreach { case (x, d) = if(d t1) map.put(x, v) if(d t2) set.put(x, false) } } } {code} The algorithm is working with arrays and lists, but all of them could be converted to RDD. Add Canopy Clustering Algorithm --- Key: SPARK-3439 URL: https://issues.apache.org/jira/browse/SPARK-3439 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Yu Ishikawa Assignee: Muhammad-Ali A'rabi Priority: Minor The canopy clustering algorithm is an unsupervised pre-clustering algorithm. It is often used as a preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3754) Spark Streaming fileSystem API is not callable from Java
[ https://issues.apache.org/jira/browse/SPARK-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290637#comment-14290637 ] Sean Owen commented on SPARK-3754: -- Is this the same as the issue reported in https://issues.apache.org/jira/browse/SPARK-5297 for fileStream? Spark Streaming fileSystem API is not callable from Java Key: SPARK-3754 URL: https://issues.apache.org/jira/browse/SPARK-3754 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 1.1.0 Reporter: holdenk Assignee: Holden Karau Priority: Critical The Spark Streaming Java API for fileSystem is not callable from Java. We should do something like with how it is handled in the Java Spark Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.
[ https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4289. -- Resolution: Not a Problem I suggest this is NotAProblem, at least not something I can see Spark can address. I think that {{toString()}} failing is a minor Hadoop bug really. There's the {{:silent}} workaround. Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance. -- Key: SPARK-4289 URL: https://issues.apache.org/jira/browse/SPARK-4289 Project: Spark Issue Type: Bug Reporter: Corey J. Nolet This one is easy to reproduce. preval job = new Job(sc.hadoopConfiguration)/pre I'm not sure what the solution would be off hand as it's happening when the shell is calling toString() on the instance of Job. The problem is, because of the failure, the instance is never actually assigned to the job val. java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283) at org.apache.hadoop.mapreduce.Job.toString(Job.java:452) at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324) at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329) at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337) at .init(console:10) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4368) Ceph integration?
[ https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290644#comment-14290644 ] Sean Owen commented on SPARK-4368: -- I don't think Spark does anything in particular to support GlusterFS; the message you cite just says it works without any special support. I haven't heard Ceph come up. Are you suggesting there is some change that needs to be made to support it? if so I think you should outline how big the change is. I think the suggestion recently has been that third-party integration projects belong outside the core project though. Ceph integration? - Key: SPARK-4368 URL: https://issues.apache.org/jira/browse/SPARK-4368 Project: Spark Issue Type: Bug Components: Input/Output Reporter: Serge Smertin There is a use-case of storing big number of relatively small BLOB objects (2-20Mb), which has to have some ugly workarounds in HDFS environments. There is a need to process those BLOBs close to data themselves, so that's why MapReduce paradigm is good, as it guarantees data locality. Ceph seems to be one of the systems that maintains both of the properties (small files and data locality) - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I know already that Spark supports GlusterFS - http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E So i wonder, could there be an integration with this storage solution and what could be the effort of doing that? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5309) Reduce Binary/String conversion overhead when reading/writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290655#comment-14290655 ] Apache Spark commented on SPARK-5309: - User 'MickDavies' has created a pull request for this issue: https://github.com/apache/spark/pull/4187 Reduce Binary/String conversion overhead when reading/writing Parquet files --- Key: SPARK-5309 URL: https://issues.apache.org/jira/browse/SPARK-5309 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: MIchael Davies Priority: Minor Converting between Parquet Binary and Java Strings can form a significant proportion of query times. For columns which have repeated String values (which is common) the same Binary will be repeatedly being converted. A simple change to cache the last converted String per column was shown to reduce query times by 25% when grouping on a data set of 66M rows on a column with many repeated Strings. A possible optimisation would be to hand responsibility for Binary encoding/decoding over to Parquet so that it could ensure that this was done only once per Binary value. Next step is to look at Parquet code and to discuss with that project, which I will do. More details are available on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org