date:20150124


[ 
https://issues.apache.org/jira/browse/SPARK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290750#comment-14290750
 ] 

Apache Spark commented on SPARK-4430:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4189

 Apache RAT Checks fail spuriously on test files
 ---

 Key: SPARK-4430
 URL: https://issues.apache.org/jira/browse/SPARK-4430
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Ryan Williams

 Several of my recent runs of {{./dev/run-tests}} have failed quickly due to 
 Apache RAT checks, e.g.:
 {code}
 $ ./dev/run-tests
 =
 Running Apache RAT checks
 =
 Could not find Apache license headers in the following files:
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/28
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/29
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b732c105-4fd3-4330-ba6d-a366b340c303/test/30
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/10
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/11
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/12
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/13
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/14
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/15
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/16
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/17
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/18
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/19
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/20
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/21
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/22
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/23
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/24
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/25
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/26
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/27
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/28
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/29
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/30
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/7
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/8
  !? 
 /Users/ryan/c/spark/streaming/FailureSuite/b98beebe-98b0-472a-b4a5-060bcd91e401/test/9
 [error] Got a return code of 1 on line 114 of the run-tests script.
 {code}
 I think it's fair to say that these are not useful errors for {{run-tests}} 
 to crash on. Ideally we could tell the linter which files we care about 
 having it lint and which we don't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5393) Flood of util.RackResolver log messages after SPARK-1714


[ 
https://issues.apache.org/jira/browse/SPARK-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290851#comment-14290851
 ] 

Apache Spark commented on SPARK-5393:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/4192

 Flood of util.RackResolver log messages after SPARK-1714
 

 Key: SPARK-5393
 URL: https://issues.apache.org/jira/browse/SPARK-5393
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical

 I thought I fixed this while working on the patch, but [~laserson] seems to 
 have encountered it when running on master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler


[ 
https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290852#comment-14290852
 ] 

Apache Spark commented on SPARK-1714:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/4192

 Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
 

 Key: SPARK-1714
 URL: https://issues.apache.org/jira/browse/SPARK-1714
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4964) Exactly-once semantics for Kafka

2015-01-24 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290747#comment-14290747
 ] 

Cody Koeninger commented on SPARK-4964:
---

Design doc at

https://docs.google.com/a/databricks.com/document/d/1IuvZhg9cOueTf1mq4qwc1fhPb5FVcaRLcyjrtG4XU1k/edit?usp=sharing

 Exactly-once semantics for Kafka
 

 Key: SPARK-4964
 URL: https://issues.apache.org/jira/browse/SPARK-4964
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Cody Koeninger

 for background, see 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html
 Requirements:
 - allow client code to implement exactly-once end-to-end semantics for Kafka 
 messages, in cases where their output storage is either idempotent or 
 transactional
 - allow client code access to Kafka offsets, rather than automatically 
 committing them
 - do not assume Zookeeper as a repository for offsets (for the transactional 
 case, offsets need to be stored in the same store as the data)
 - allow failure recovery without lost or duplicated messages, even in cases 
 where a checkpoint cannot be restored (for instance, because code must be 
 updated)
 Design:
 The basic idea is to make an rdd where each partition corresponds to a given 
 Kafka topic, partition, starting offset, and ending offset.  That allows for 
 deterministic replay of data from Kafka (as long as there is enough log 
 retention).
 Client code is responsible for committing offsets, either transactionally to 
 the same store that data is being written to, or in the case of idempotent 
 data, after data has been written.
 PR of a sample implementation for both the batch and dstream case is 
 forthcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2285) Give various TaskEndReason subclass more descriptive names


[ 
https://issues.apache.org/jira/browse/SPARK-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290785#comment-14290785
 ] 

Apache Spark commented on SPARK-2285:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4191

 Give various TaskEndReason subclass more descriptive names
 --

 Key: SPARK-2285
 URL: https://issues.apache.org/jira/browse/SPARK-2285
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Priority: Minor

 It is just strange to have org.apache.spark.Success be a TaskEndReason. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4831) Current directory always on classpath with spark-submit

[
https://issues.apache.org/jira/browse/SPARK-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-4831.
--
Resolution: Fixed
Fix Version/s: 1.3.0
Assignee: Daniel Darabos

Looks like this was merged in
https://github.com/apache/spark/commit/7cb3f54793124c527d62906c565aba2c3544e422

Current directory always on classpath with spark-submit
---

Key: SPARK-4831
URL: https://issues.apache.org/jira/browse/SPARK-4831
Project: Spark
Issue Type: Bug
Components: Deploy
Affects Versions: 1.1.1, 1.2.0
Reporter: Daniel Darabos
Assignee: Daniel Darabos
Priority: Minor
Fix For: 1.3.0

We had a situation where we were launching an application with spark-submit,
and a file (play.plugins) was on the classpath twice, causing problems
(trying to register plugins twice). Upon investigating how it got on the
classpath twice, we found that it was present in one of our jars, and also in
the current working directory. But the one in the current working directory
should not be on the classpath. We never asked spark-submit to put the
current directory on the classpath.
I think this is caused by a line in
[compute-classpath.sh|https://github.com/apache/spark/blob/v1.2.0-rc2/bin/compute-classpath.sh#L28]:
{code}
CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH
{code}
Now if SPARK_CLASSPATH is empty, the empty string is added to the classpath,
which means the current working directory.
We tried setting SPARK_CLASSPATH to a bogus value, but that is [not
allowed|https://github.com/apache/spark/blob/v1.2.0-rc2/core/src/main/scala/org/apache/spark/SparkConf.scala#L312].
What is the right solution? Only add SPARK_CLASSPATH if it's non-empty? I can
send a pull request for that I think. Thanks!

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4147) Reduce log4j dependency


[ 
https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290752#comment-14290752
 ] 

Apache Spark commented on SPARK-4147:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4190

 Reduce log4j dependency
 ---

 Key: SPARK-4147
 URL: https://issues.apache.org/jira/browse/SPARK-4147
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Tobias Pfeiffer

 spark-core has a hard dependency on log4j, which shouldn't be necessary since 
 slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my 
 sbt file.
 Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. 
 However, removing the log4j dependency fails because in 
 https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121
  a static method of org.apache.log4j.LogManager is accessed *even if* log4j 
 is not in use.
 I guess removing all dependencies on log4j may be a bigger task, but it would 
 be a great help if the access to LogManager would be done only if log4j use 
 was detected before. (This is a 2-line change.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2280) Java Scala reference docs should describe function reference behavior.


 [ 
https://issues.apache.org/jira/browse/SPARK-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2280:
-
Priority: Minor  (was: Major)
Assignee: Sean Owen

I'd like to work on this, but, would a change to add a bunch of {{@tparam}} in 
all the RDD classes going to be welcome or too much merge noise? It's not hard 
to describe all of these params.

 Java  Scala reference docs should describe function reference behavior.
 

 Key: SPARK-2280
 URL: https://issues.apache.org/jira/browse/SPARK-2280
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Hans Uhlig
Assignee: Sean Owen
Priority: Minor

 Example
 K JavaPairRDDK,IterableT groupBy(FunctionT,K f)
 Return an RDD of grouped elements. Each group consists of a key and a 
 sequence of elements mapping to that key. 
 T and K are not described and there is no explanation of what the function's 
 inputs and outputs should be and how GroupBy uses this information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1960) EOFException when file size 0 exists when use sc.sequenceFile[K,V](path)


 [ 
https://issues.apache.org/jira/browse/SPARK-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1960.
--
Resolution: Not a Problem

An empty {{SequenceFile}} will still contain some header info. For example 
when I write an empty one (configured to contain {{LongWritable}}) I get 
roughly:

{code}
SEQ^F!org.apache.hadoop.io.LongWritable!org.apache.hadoop.io.LongWritable^A^@*org.apache.hadoop.io.compress.DefaultCodec^@^@^@^@ï9cp84º74K=æÅ3!92^A^F
{code}

So an empty {{SequenceFile}} is indeed malformed, so I don't think this is a 
bug. An error is correct. Reopen if I misunderstand.

 EOFException when file size 0 exists when use sc.sequenceFile[K,V](path)
 --

 Key: SPARK-1960
 URL: https://issues.apache.org/jira/browse/SPARK-1960
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Eunsu Yun

 java.io.EOFException throws when use sc.sequenceFile[K,V] if there is a file 
 which size is 0. 
 I also tested sc.textFile() in the same condition and it does not throw 
 EOFException.
 val text = sc.sequenceFile[Long, String](data-gz/*.dat.gz)
 val result = text.filter(filterValid)
 result.saveAsTextFile(data-out/)
 --
 java.io.EOFException
   at java.io.DataInputStream.readFully(DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1845)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1759)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1773)
   at 
 org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:49)
   at 
 org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
   at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 ..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway

2015-01-24 Thread Dale Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290847#comment-14290847
 ] 

Dale Richardson commented on SPARK-5388:


Hi Andrew,
I think the idea is well worth considering. In response to the requirement of 
making it easier for client- master communications to pass through 
restrictive fire-walling, have you considered just using Akka's REST gateway 
(http://doc.akka.io/docs/akka/1.3.1/scala/http.html)?

I also have a question if there is an intention for other entities (such as job 
servers) to communicate with the master at all? If so then the proposed gateway 
is semantically defined at a fairly low level (just RPC over JSON/HTTP). This 
is fine if the interface is not going to be exposed to anybody who is not a 
spark developer with detailed knowledge of spark internals. Did you use the 
term “REST” to simply mean RPC over JSON/HTTP?

Creating a REST interface is more then a HTTP RPC gateway. If the interface is 
going to be exposed to 3rd parties (such as developers of Job servers and web 
notebooks etc) then there is a benefit to simplifying some of the exposed 
application semantics, and exposing an API that is more integrated with HTTP’s 
protocol semantics which most people are already familiar with - this is what a 
true REST interface does and if you are defining an endpoint for others to use 
it is a very powerful concept that allows other people to quickly grasp how to 
properly use the exposed interface.

A rough sketch of a more “REST”ed version of the API would be:

*Submit_driver_request*
HTTP POST JSON body of request http://host:port/SparkMaster?SubmitDriver
Responds with standard HTTP Response including allocated DRIVER_ID if driver 
submission ok, http error codes with spark specific error if not.

*Get status of DRIVER*
HTTP GET http://host:port/SparkMaster/Drivers/DRIVER_ID
Responds with JSON body containing information on driver execution.  If no 
record of driver_id, then http error code 404 (Not found) returned.

*Kill Driver request*
HTTP DELETE http://host:port/SparkMaster/Drivers/DRIVER_ID
Responds with JSON body containing information on driver kill request, or http 
error code if an error occurs.

I would be happy to prototype something like this up to test the concept out 
for you if you are looking for something more than just RPC over JSON/HTTP.



 Provide a stable application submission gateway
 ---

 Key: SPARK-5388
 URL: https://issues.apache.org/jira/browse/SPARK-5388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Attachments: Stable Spark Standalone Submission.pdf


 The existing submission gateway in standalone mode is not compatible across 
 Spark versions. If you have a newer version of Spark submitting to an older 
 version of the standalone Master, it is currently not guaranteed to work. The 
 goal is to provide a stable REST interface to replace this channel.
 The first cut implementation will target standalone cluster mode because 
 there are very few messages exchanged. The design, however, will be general 
 enough to eventually support this for other cluster managers too. Note that 
 this is not necessarily required in YARN because we already use YARN's stable 
 interface to submit applications there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4697) System properties should override environment variables


 [ 
https://issues.apache.org/jira/browse/SPARK-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4697.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

This looks like it was fixed in 
https://github.com/apache/spark/commit/9dea64e53ad8df8a3160c0f4010811af1e73dd6f

 System properties should override environment variables
 ---

 Key: SPARK-4697
 URL: https://issues.apache.org/jira/browse/SPARK-4697
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
 Fix For: 1.3.0


 I found some arguments in yarn module take environment variables before 
 system properties while the latter override the former in core module.
 This should be changed in org.apache.spark.deploy.yarn.ClientArguments and 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1029) spark Window shell script errors regarding shell script location reference


 [ 
https://issues.apache.org/jira/browse/SPARK-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1029.
--
   Resolution: Fixed
Fix Version/s: 1.0.0

Looks like this was fixed in 
https://github.com/apache/spark/commit/4e510b0b0c8a69cfe0ee037b37661caf9bf1d057 
for 1.0.0

 spark Window shell script errors regarding shell script location reference
 --

 Key: SPARK-1029
 URL: https://issues.apache.org/jira/browse/SPARK-1029
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 0.9.0
Reporter: Qiuzhuang Lian
Priority: Minor
 Fix For: 1.0.0


 When launch spark-shell.cmd in Window 7, I got following errors
 E:\projects\amplab\incubator-sparkbin\spark-shell.cmd
 'E:\projects\amplab\incubator-spark\bin\..\sbin\spark-class2.cmd' is not 
 recognized as an internal or external command,
 operable program or batch file.
 E:\projects\amplab\incubator-sparkbin\spark-shell.cmd
 'E:\projects\amplab\incubator-spark\bin\..\sbin\compute-classpath.cmd' is 
 not recognized as an internal or external co
 mmand,
 operable program or batch file.
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/repl/Main
 I am attaching my patches,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4147) Reduce log4j dependency


 [ 
https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4147:
-
Summary: Reduce log4j dependency  (was: Remove log4j dependency)

 Reduce log4j dependency
 ---

 Key: SPARK-4147
 URL: https://issues.apache.org/jira/browse/SPARK-4147
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Tobias Pfeiffer

 spark-core has a hard dependency on log4j, which shouldn't be necessary since 
 slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my 
 sbt file.
 Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. 
 However, removing the log4j dependency fails because in 
 https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121
  a static method of org.apache.log4j.LogManager is accessed *even if* log4j 
 is not in use.
 I guess removing all dependencies on log4j may be a bigger task, but it would 
 be a great help if the access to LogManager would be done only if log4j use 
 was detected before. (This is a 2-line change.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4147) Remove log4j dependency


[ 
https://issues.apache.org/jira/browse/SPARK-4147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290751#comment-14290751
 ] 

Sean Owen commented on SPARK-4147:
--

[~tgpfeiffer] Yeah that's a good change, since all code hits {{Logging}} 
quickly. It is certainly not the only direct use of log4j, but maybe this 
actually makes the issue go away for some subset of use cases. I'll make a PR.

[~nemccarthy] I don't think it forces log4j on callers, since you can reroute 
calls to log4j to slf4j. Yes it's extra plumbing. There's not another way to 
control log levels though, since there is no API for it in slf4j.

 Remove log4j dependency
 ---

 Key: SPARK-4147
 URL: https://issues.apache.org/jira/browse/SPARK-4147
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Tobias Pfeiffer

 spark-core has a hard dependency on log4j, which shouldn't be necessary since 
 slf4j is used. I tried to exclude slf4j-log4j12 and log4j dependencies in my 
 sbt file.
 Excluding org.slf4j.slf4j-log4j12 works fine if logback is on the classpath. 
 However, removing the log4j dependency fails because in 
 https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/Logging.scala#L121
  a static method of org.apache.log4j.LogManager is accessed *even if* log4j 
 is not in use.
 I guess removing all dependencies on log4j may be a bigger task, but it would 
 be a great help if the access to LogManager would be done only if log4j use 
 was detected before. (This is a 2-line change.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4491) Using sbt assembly with spark as dep requires Phd in sbt

[
https://issues.apache.org/jira/browse/SPARK-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-4491.
--
Resolution: Won't Fix

I don't see a reason here that indicates the Spark build should change its
behavior. I do agree that this build and its deps are hairy, and harmonizing
dependencies within Spark and with an app is often a deep rabbit hole.

Using sbt assembly with spark as dep requires Phd in sbt

Key: SPARK-4491
URL: https://issues.apache.org/jira/browse/SPARK-4491
Project: Spark
Issue Type: Question
Reporter: sam

I get the dreaded deduplicate error from sbt. I resolved the issue (I think,
I managed to run the SimpleApp example) here
http://stackoverflow.com/a/27018691/1586965
My question is, is this wise? What is wrong with changing the `deduplicate`
bit to `first`. Why isn't it this by default?
If this isn't the way to make it work, please could someone provide an
explanation of the correct way with .sbt examples. Having googled, every
example I see is different because it changes depending on what deps the
person has ... surely there has to be an automagic way of doing it (if my way
isn't it)?
One final point, SBT seems to be blaming Spark for causing the problem in
their documentation: https://github.com/sbt/sbt-assembly is this fair? Is
Spark doing something wrong in the way they build their jars? Or should SBT
be renamed to CBT (Complicated Build Tool that will make you need Cognitive
Behavioural Therapy after use).
NOTE: Satire JFF, really I love both SBT Spark :)

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5398) Support the eu-central-1 region for spark-ec2

2015-01-24 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-5398:
---

 Summary: Support the eu-central-1 region for spark-ec2
 Key: SPARK-5398
 URL: https://issues.apache.org/jira/browse/SPARK-5398
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor


{{spark-ec2}} [doesn't currently 
support|https://github.com/mesos/spark-ec2/tree/branch-1.3/ami-list] the 
{{eu-central-1}} region.

You can see the [full list of EC2 regions 
here|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html].
 {{eu-central-1}} is the only one missing as of Jan 2015. ({{cn-north-1}}, for 
some reason, is not listed there.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5399) tree Losses strings should match loss names

2015-01-24 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-5399:


 Summary: tree Losses strings should match loss names
 Key: SPARK-5399
 URL: https://issues.apache.org/jira/browse/SPARK-5399
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0, 1.2.1
Reporter: Joseph K. Bradley
Priority: Minor


tree.loss.Losses.fromString expects certain String names for losses.  These do 
not match the names of the loss classes but should.  I believe these strings 
were the original names of the losses, and we forgot to correct the strings 
when we renamed the losses.

Currently:
{code}
case leastSquaresError = SquaredError
case leastAbsoluteError = AbsoluteError
case logLoss = LogLoss
{code}

Proposed:
{code}
case SquaredError = SquaredError
case AbsoluteError = AbsoluteError
case LogLoss = LogLoss
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8


 [ 
https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3359:
-
Assignee: (was: Sean Owen)

I spent more time on this tonight, mostly looking at the {{genjavadoc}} code, 
and I don't think this can be made to work, not without both touching up 
hundreds of scaladoc comments, and overhauling {{genjavadoc}}. The rough 
translation it does works for javadoc 7, but not nearly for the stricter 
javadoc 8. It wouldn't be a matter of small fixes.

Realistically I'd suggest using javadoc 7, or, altering the doc generation to 
produce javadoc and scaladoc separately rather than try to get unidoc to work.

In the meantime I can submit a PR with a number of small fixes that at least 
resolve more javadoc 8 errors.

 `sbt/sbt unidoc` doesn't work with Java 8
 -

 Key: SPARK-3359
 URL: https://issues.apache.org/jira/browse/SPARK-3359
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Priority: Minor

 It seems that Java 8 is stricter on JavaDoc. I got many error messages like
 {code}
 [error] 
 /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2:
  error: modifier private not allowed here
 [error] private abstract interface SparkHadoopMapRedUtil {
 [error]  ^
 {code}
 This is minor because we can always use Java 6/7 to generate the doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-24 Thread Travis Galoppo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290971#comment-14290971
 ] 

Travis Galoppo commented on SPARK-5400:
---

Hmm.  This has me thinking in a different direction.  We could generalize the 
expectation-maximization algorithm to work with any mixture model supporting a 
set of necessary likelihood compute/update methods... then we could ask for, 
e.g., new ExpectationMaximization[GaussianMixtureModel].  This would 
de-couple the model and the algorithm, and could open the door for the 
implementation to be applied to (for instance) tomographic image reconstruction 
(which seems like a great fit for Spark given the volume of data involved).


 Rename GaussianMixtureEM to GaussianMixture
 ---

 Key: SPARK-5400
 URL: https://issues.apache.org/jira/browse/SPARK-5400
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 GaussianMixtureEM is following the old naming convention of including the 
 optimization algorithm name in the class title.  We should probably rename it 
 to GaussianMixture so that it can use other optimization algorithms in the 
 future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5401) Executor ID should be set before MetricsSystem is created


[ 
https://issues.apache.org/jira/browse/SPARK-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291007#comment-14291007
 ] 

Apache Spark commented on SPARK-5401:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4194

 Executor ID should be set before MetricsSystem is created
 -

 Key: SPARK-5401
 URL: https://issues.apache.org/jira/browse/SPARK-5401
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams

 MetricsSystem construction [attempts to namespace metrics from each executor 
 using that executor's 
 ID|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L131].
 The ID is [currently set at Executor construction 
 time|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L76-L79]
  (uncoincidentally, just before the {{ExecutorSource}} is registered), but 
 this is after the {{MetricsSystem}} has been initialized (which [happens 
 during {{SparkEnv}} 
 construction|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/SparkEnv.scala#L323-L332],
  which itself happens during {{ExecutorBackend}} construction, *before* 
 {{Executor}} construction).
 I noticed this problem because I wasn't seeing any JVM metrics from my 
 executors in a Graphite dashboard I've set up; turns out all the executors 
 (and the driver) were namespacing their metrics under driver, and 
 Graphite responds to such a situation by only taking the last value it 
 receives for each metric within a configurable time window (e.g. 10s). I 
 was seeing per-executor metrics, properly namespaced with each executor's ID, 
 from {{ExecutorSource}}, which as I mentioned above is registered after the 
 executor ID is set.
 I have a one-line fix for this that I will submit shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5402) Log executor ID at executor-construction time


[ 
https://issues.apache.org/jira/browse/SPARK-5402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291010#comment-14291010
 ] 

Apache Spark commented on SPARK-5402:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4195

 Log executor ID at executor-construction time
 -

 Key: SPARK-5402
 URL: https://issues.apache.org/jira/browse/SPARK-5402
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor

 One stumbling block I've hit while debugging Spark-on-YARN jobs is that 
 {{yarn logs}} presents each executor's stderr/stdout by container name, but I 
 often need to find the logs for a specific executor ID; the executor ID isn't 
 printed anywhere convenient in each executor's logs, afaict.
 I added a simple {{logInfo}} to {{Executor.scala}} locally and it's been 
 useful, so I'd like to merge it upstream.
 PR forthcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5401) Executor ID should be set before MetricsSystem is created

2015-01-24 Thread Ryan Williams (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams updated SPARK-5401:
-
Description: 
MetricsSystem construction [attempts to namespace metrics from each executor 
using that executor's 
ID|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L131].

The ID is [currently set at Executor construction 
time|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L76-L79]
 (uncoincidentally, just before the {{ExecutorSource}} is registered), but this 
is after the {{MetricsSystem}} has been initialized (which [happens during 
{{SparkEnv}} 
construction|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/SparkEnv.scala#L323-L332],
 which itself happens during {{ExecutorBackend}} construction, *before* 
{{Executor}} construction).

I noticed this problem because I wasn't seeing any JVM metrics from my 
executors in a Graphite dashboard I've set up; turns out all the executors (and 
the driver) were namespacing their metrics under driver, and Graphite 
responds to such a situation by only taking the last value it receives for each 
metric within a configurable time window (e.g. 10s). I was seeing 
per-executor metrics, properly namespaced with each executor's ID, from 
{{ExecutorSource}}, which as I mentioned above is registered after the executor 
ID is set.

I have a one-line fix for this that I will submit shortly.

  was:
MetricsSystem construction [attempts to namespace metrics from each executor 
using that executor's 
ID|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L131].

The ID is [currently set at Executor construction 
time|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L76-L79]
 (uncoincidentally, just before the `ExecutorSource` is registered), but this 
is after the `MetricsSystem` has been initialized (which [happens during 
`SparkEnv` 
construction|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/SparkEnv.scala#L323-L332],
 which itself happens during `ExecutorBackend` construction, *before* 
`Executor` construction).

I noticed this problem because I wasn't seeing any JVM metrics from my 
executors in a Graphite dashboard I've set up; turns out all the executors (and 
the driver) were namespacing their metrics under driver, and Graphite 
responds to such a situation by only taking the last value it receives for each 
metric within a configurable time window (e.g. 10s). I was seeing 
per-executor metrics, properly namespaced with each executor's ID, from 
`ExecutorSource`, which as I mentioned above is registered after the executor 
ID is set.

I have a one-line fix for this that I will submit shortly.


 Executor ID should be set before MetricsSystem is created
 -

 Key: SPARK-5401
 URL: https://issues.apache.org/jira/browse/SPARK-5401
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams

 MetricsSystem construction [attempts to namespace metrics from each executor 
 using that executor's 
 ID|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L131].
 The ID is [currently set at Executor construction 
 time|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L76-L79]
  (uncoincidentally, just before the {{ExecutorSource}} is registered), but 
 this is after the {{MetricsSystem}} has been initialized (which [happens 
 during {{SparkEnv}} 
 construction|https://github.com/apache/spark/blob/0d1e67ee9b29b51bccfc8a319afe9f9b4581afc8/core/src/main/scala/org/apache/spark/SparkEnv.scala#L323-L332],
  which itself happens during {{ExecutorBackend}} construction, *before* 
 {{Executor}} construction).
 I noticed this problem because I wasn't seeing any JVM metrics from my 
 executors in a Graphite dashboard I've set up; turns out all the executors 
 (and the driver) were namespacing their metrics under driver, and 
 Graphite responds to such a situation by only taking the last value it 
 receives for each metric within a configurable time window (e.g. 10s). I 
 was seeing per-executor metrics, properly namespaced with each executor's ID, 
 from {{ExecutorSource}}, which as I mentioned above is registered after the 
 executor ID is set.
 I

[jira] [Created] (SPARK-5402) Log executor ID at executor-construction time

2015-01-24 Thread Ryan Williams (JIRA)

Ryan Williams created SPARK-5402:


 Summary: Log executor ID at executor-construction time
 Key: SPARK-5402
 URL: https://issues.apache.org/jira/browse/SPARK-5402
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Priority: Minor


One stumbling block I've hit while debugging Spark-on-YARN jobs is that {{yarn 
logs}} presents each executor's stderr/stdout by container name, but I often 
need to find the logs for a specific executor ID; the executor ID isn't printed 
anywhere convenient in each executor's logs, afaict.

I added a simple {{logInfo}} to {{Executor.scala}} locally and it's been 
useful, so I'd like to merge it upstream.

PR forthcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5235) Determine serializability of SQLContext


 [ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5235.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

This was merged in 
https://github.com/apache/spark/commit/2fd7f72b6b0b24bec12331c7bbbcf6bfc265d2ec

 Determine serializability of SQLContext
 ---

 Key: SPARK-5235
 URL: https://issues.apache.org/jira/browse/SPARK-5235
 Project: Spark
  Issue Type: Sub-task
Reporter: Alex Baretta
 Fix For: 1.3.0


 The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
 the stack trace I get when running SQL queries against a Parquet file.
 {code}
 Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted 
 due to stage failure: Task not serializable: 
 java.io.NotSerializableException: org.apache.spark.sql.SQLConf
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290874#comment-14290874
]

Sean Owen commented on SPARK-4452:
--

Can this JIRA be resolved now that its children are resolved, or is the more to
this one?

Shuffle data structures can starve others on the same thread for memory

Key: SPARK-4452
URL: https://issues.apache.org/jira/browse/SPARK-4452
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Tianshuo Deng
Assignee: Tianshuo Deng
Priority: Critical

When an Aggregator is used with ExternalSorter in a task, spark will create
many small files and could cause too many files open error during merging.
Currently, ShuffleMemoryManager does not work well when there are 2 spillable
objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used
by Aggregator) in this case. Here is an example: Due to the usage of mapside
aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may
ask as much memory as it can, which is totalMem/numberOfThreads. Then later
on when ExternalSorter is created in the same thread, the
ShuffleMemoryManager could refuse to allocate more memory to it, since the
memory is already given to the previous requested
object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling
small files(due to the lack of memory)
I'm currently working on a PR to address these two issues. It will include
following changes:
1. The ShuffleMemoryManager should not only track the memory usage for each
thread, but also the object who holds the memory
2. The ShuffleMemoryManager should be able to trigger the spilling of a
spillable object. In this way, if a new object in a thread is requesting
memory, the old occupant could be evicted/spilled. Previously the spillable
objects trigger spilling by themselves. So one may not trigger spilling even
if another object in the same thread needs more memory. After this change The
ShuffleMemoryManager could trigger the spilling of an object if it needs to.
3. Make the iterator of ExternalAppendOnlyMap spillable. Previously
ExternalAppendOnlyMap returns an destructive iterator and can not be spilled
after the iterator is returned. This should be changed so that even after the
iterator is returned, the ShuffleMemoryManager can still spill it.
Currently, I have a working branch in progress:
https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made
change 3 and have a prototype of change 1 and 2 to evict spillable from
memory manager, still in progress. I will send a PR when it's done.
Any feedback or thoughts on this change is highly appreciated !

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2015-01-24 Thread Sandy Ryza (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290885#comment-14290885
]

Sandy Ryza commented on SPARK-4452:
---

I think there's more to this one, the subtasks solved the most egregious
issues, but shuffle data structures can still hog memory in detrimental ways
described in some of the comments above.

Shuffle data structures can starve others on the same thread for memory

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8


[ 
https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290934#comment-14290934
 ] 

Apache Spark commented on SPARK-3359:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4193

 `sbt/sbt unidoc` doesn't work with Java 8
 -

 Key: SPARK-3359
 URL: https://issues.apache.org/jira/browse/SPARK-3359
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Priority: Minor

 It seems that Java 8 is stricter on JavaDoc. I got many error messages like
 {code}
 [error] 
 /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2:
  error: modifier private not allowed here
 [error] private abstract interface SparkHadoopMapRedUtil {
 [error]  ^
 {code}
 This is minor because we can always use Java 6/7 to generate the doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-01-24 Thread ding (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290961#comment-14290961
 ] 

ding commented on SPARK-4105:
-

I hit this error when using pagerank(It cannot be consistent repro as I only 
hit once). I am not using the KryoSerializer but I am using the default 
serializer. The Spark code is get from chunk at 2015/1/19 which should be later 
than spark 1.2.0. 

15/01/23 23:32:57 WARN scheduler.TaskSetManager: Lost task 347.0 in stage 
9461.0 (TID 302687, sr213): FetchFailed(BlockManagerId(13, sr207, 49805), 
shuffleId=399, mapId=461, reduceId=347, message=
org.apache.spark.shuffle.FetchFailedException: FAILED_TO_UNCOMPRESS(5)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.graphx.impl.VertexPartitionBaseOps.aggregateUsingIndex(VertexPartitionBaseOps.scala:207)
at 
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$5$$anonfun$apply$4.apply(VertexRDDImpl.scala:171)
at 
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$5$$anonfun$apply$4.apply(VertexRDDImpl.scala:171)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:113)
at 
org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:111)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:65)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
at 
org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:143)
at 
org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1165)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:300)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:299)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53)



 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL:

[jira] [Commented] (SPARK-3489) support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params

2015-01-24 Thread Mohit Jaggi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290967#comment-14290967
 ] 

Mohit Jaggi commented on SPARK-3489:


pull request does exist here:
https://github.com/apache/spark/pull/2429

use case example: 
https://github.com/AyasdiOpenSource/bigdf/blob/master/src/main/scala/com/ayasdi/bigdf/DFUtil.scala#L86

 support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params
 --

 Key: SPARK-3489
 URL: https://issues.apache.org/jira/browse/SPARK-3489
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Mohit Jaggi
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4642) Documents about running-on-YARN needs update


 [ 
https://issues.apache.org/jira/browse/SPARK-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4642.
--
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.1.2
   1.3.0
 Assignee: Masayoshi TSUZUKI

Looks like this went in to master, and branch 1.2 / 1.1: 
https://github.com/apache/spark/commit/692f49378f7d384d5c9c5ab7451a1c1e66f91c50

 Documents about running-on-YARN needs update
 

 Key: SPARK-4642
 URL: https://issues.apache.org/jira/browse/SPARK-4642
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
Priority: Minor
 Fix For: 1.3.0, 1.1.2, 1.2.1


 Documents about running-on-YARN needs update
 There are some parameters missing in the document about running-on-YARN page.
 We need to add the descriptions about the following parameters:
   - spark.yarn.report.interval
   - spark.yarn.queue
   - spark.yarn.user.classpath.first
   - spark.yarn.scheduler.reporterThread.maxFailures
 And the description about this default parameter is not strictly accurate:
   - spark.yarn.submit.file.replication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5028) Add total received and processed records metrics to Streaming UI


 [ 
https://issues.apache.org/jira/browse/SPARK-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5028.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Saisai Shao

Also one that was merged already: 
https://github.com/apache/spark/commit/fdc2aa4918fd4c510f04812b782cc0bfef9a2107

 Add total received and processed records metrics to Streaming UI
 

 Key: SPARK-5028
 URL: https://issues.apache.org/jira/browse/SPARK-5028
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Saisai Shao
Assignee: Saisai Shao
 Fix For: 1.3.0


 Followed by SPARK-4537 to add total received records and total processed 
 records in Streaming web ui.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5235) Determine serializability of SQLContext


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290905#comment-14290905
 ] 

Reynold Xin edited comment on SPARK-5235 at 1/25/15 1:00 AM:
-

Sean - this was not done. We merged a patch to make it serializable again, but 
for 1.3 we should decide whether we want it to be serializable for real.


was (Author: rxin):
Sean - this was not done. We merged a patch to make it serializable again, but 
for 1.3 we should decide whether we wanted to be serializable for real.

 Determine serializability of SQLContext
 ---

 Key: SPARK-5235
 URL: https://issues.apache.org/jira/browse/SPARK-5235
 Project: Spark
  Issue Type: Sub-task
Reporter: Alex Baretta
 Fix For: 1.3.0


 The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
 the stack trace I get when running SQL queries against a Parquet file.
 {code}
 Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted 
 due to stage failure: Task not serializable: 
 java.io.NotSerializableException: org.apache.spark.sql.SQLConf
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables

2015-01-24 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290962#comment-14290962
 ] 

Imran Rashid commented on SPARK-3298:
-

If {{allowOverwrite}} defaulted to {{true}}, wouldn't that be closer to keeping 
the existing behavior, but still allow someone to request the check if they 
wanted it?  Maybe I don't properly understand the current behavior, but it 
seems like it will effectively uncache the existing table and create a new one 
(even if the uncaching is happening later by the context cleaner).

 [SQL] registerAsTable / registerTempTable overwrites old tables
 ---

 Key: SPARK-3298
 URL: https://issues.apache.org/jira/browse/SPARK-3298
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Evan Chan
Priority: Minor
  Labels: newbie

 At least in Spark 1.0.2,  calling registerAsTable(a) when a had been 
 registered before does not cause an error.  However, there is no way to 
 access the old table, even though it may be cached and taking up space.
 How about at least throwing an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4934) Connection key is hard to read

2015-01-24 Thread Hong Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen resolved SPARK-4934.
--
Resolution: Not a Problem

 Connection key is hard to read
 --

 Key: SPARK-4934
 URL: https://issues.apache.org/jira/browse/SPARK-4934
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Hong Shen

 When I run a big spark job, executor have a lot of log,
 14/12/23 15:25:31 INFO network.ConnectionManager: key already cancelled ? 
 sun.nio.ch.SelectionKeyImpl@52b0e278
 java.nio.channels.CancelledKeyException
 at 
 org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:310)
 at 
 org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
 It's hard to know which connection is cancelled. maby we can change to 
 logInfo(Connection already cancelled ?  + con.getRemoteAddress(), e)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5038) Add explicit return type for all implicit functions


 [ 
https://issues.apache.org/jira/browse/SPARK-5038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5038.
--
Resolution: Fixed

Looks like this was merged in 
https://github.com/apache/spark/commit/c88a3d7fca20d36ee566d48e0cb91fe33a7a6d99
and
https://github.com/apache/spark/commit/7749dd6c36a182478b20f4636734c8db0b7ddb00

 Add explicit return type for all implicit functions
 ---

 Key: SPARK-5038
 URL: https://issues.apache.org/jira/browse/SPARK-5038
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.3.0


 As we learned in https://github.com/apache/spark/pull/3580, not explicitly 
 typing implicit functions can lead to compiler bugs and potentially 
 unexpected runtime behavior. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5074) Fix a non-deterministic test in org.apache.spark.scheduler.DAGSchedulerSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5074.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Shixiong Zhu

This was merged in 
https://github.com/apache/spark/commit/5c506cecb933b156b2f06a688ee08c4347bf0d47

 Fix a non-deterministic test in org.apache.spark.scheduler.DAGSchedulerSuite
 

 Key: SPARK-5074
 URL: https://issues.apache.org/jira/browse/SPARK-5074
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
  Labels: flaky-test
 Fix For: 1.3.0


 fix the following non-deterministic test in 
 org.apache.spark.scheduler.DAGSchedulerSuite
 {noformat}
 [info] DAGSchedulerSuite:
 [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** 
 (27 milliseconds)
 [info]   1 did not equal 2 (DAGSchedulerSuite.scala:242)
 [info]   org.scalatest.exceptions.TestFailedException:
 [info]   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
 [info]   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
 [info]   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2.apply$mcV$sp(DAGSchedulerSuite.scala:242)
 [info]   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2.apply(DAGSchedulerSuite.scala:239)
 [info]   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$2.apply(DAGSchedulerSuite.scala:239)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
 [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
 [info]   at 
 org.apache.spark.scheduler.DAGSchedulerSuite.org$scalatest$BeforeAndAfter$$super$runTest(DAGSchedulerSuite.scala:60)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5131) A typo in configuration doc


 [ 
https://issues.apache.org/jira/browse/SPARK-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5131.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: uncleGen

The PR was resolved for master and 1.2: 
https://github.com/apache/spark/commit/39e333ec4350ddafe29ee0958c37eec07bec85df

 A typo in configuration doc
 ---

 Key: SPARK-5131
 URL: https://issues.apache.org/jira/browse/SPARK-5131
 Project: Spark
  Issue Type: Bug
Reporter: uncleGen
Assignee: uncleGen
Priority: Minor
 Fix For: 1.3.0, 1.2.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-5235) Determine serializability of SQLContext


 [ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-5235:


Sean - this was not done. We merged a patch to make it serializable again, but 
for 1.3 we should decide whether we wanted to be serializable for real.

 Determine serializability of SQLContext
 ---

 Key: SPARK-5235
 URL: https://issues.apache.org/jira/browse/SPARK-5235
 Project: Spark
  Issue Type: Sub-task
Reporter: Alex Baretta
 Fix For: 1.3.0


 The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
 the stack trace I get when running SQL queries against a Parquet file.
 {code}
 Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted 
 due to stage failure: Task not serializable: 
 java.io.NotSerializableException: org.apache.spark.sql.SQLConf
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) Determine serializability of SQLContext


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290921#comment-14290921
 ] 

Sean Owen commented on SPARK-5235:
--

Sounds good, keep it open. This particular change was merged, but this can 
track the broader question, yes.

 Determine serializability of SQLContext
 ---

 Key: SPARK-5235
 URL: https://issues.apache.org/jira/browse/SPARK-5235
 Project: Spark
  Issue Type: Sub-task
Reporter: Alex Baretta
 Fix For: 1.3.0


 The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
 the stack trace I get when running SQL queries against a Parquet file.
 {code}
 Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted 
 due to stage failure: Task not serializable: 
 java.io.NotSerializableException: org.apache.spark.sql.SQLConf
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-24 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-5400:


 Summary: Rename GaussianMixtureEM to GaussianMixture
 Key: SPARK-5400
 URL: https://issues.apache.org/jira/browse/SPARK-5400
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


GaussianMixtureEM is following the old naming convention of including the 
optimization algorithm name in the class title.  We should probably rename it 
to GaussianMixture so that it can use other optimization algorithms in the 
future.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-01-24 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290923#comment-14290923
 ] 

Florian Verhein commented on SPARK-3185:



Sure [~grzegorz-dubicki]. You need to build with the correct version profiles. 
See for example:

https://github.com/florianverhein/spark-ec2/blob/packer/spark/init.sh
https://github.com/florianverhein/spark-ec2/blob/packer/tachyon/init.sh

Note that I'm using Hadoop 2.4.1 (which I install on the image).


 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-24 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290924#comment-14290924
 ] 

Joseph K. Bradley commented on SPARK-5400:
--

[~mengxr] [~tgaloppo]  What do you think?

 Rename GaussianMixtureEM to GaussianMixture
 ---

 Key: SPARK-5400
 URL: https://issues.apache.org/jira/browse/SPARK-5400
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 GaussianMixtureEM is following the old naming convention of including the 
 optimization algorithm name in the class title.  We should probably rename it 
 to GaussianMixture so that it can use other optimization algorithms in the 
 future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later


[ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290744#comment-14290744
 ] 

Apache Spark commented on SPARK-4267:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4188

 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
 --

 Key: SPARK-4267
 URL: https://issues.apache.org/jira/browse/SPARK-4267
 Project: Spark
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA

 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 
 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:
 {code}
  ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn 
 -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0
 {code}
 Then Spark on YARN fails to launch jobs with NPE.
 {code}
 $ bin/spark-shell --master yarn-client
 scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line 
 = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a 
 + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
 java.lang.NullPointerException
   
   
 
 at 
 org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
 at 
 org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291)   
   
   
  
 at 
 org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
 at $iwC$$iwC$$iwC$$iwC.init(console:13)   
   
   
 
 at $iwC$$iwC$$iwC.init(console:18)
 at $iwC$$iwC.init(console:20) 
   
   
 
 at $iwC.init(console:22)
 at init(console:24)   
   
   
 
 at .init(console:28)
 at .clinit(console)   
   
   
 
 at .init(console:7)
 at .clinit(console)   
   
   
 
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   
 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   
   
   
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) 
   
   
  
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   
   
  
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)

[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later


[ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290742#comment-14290742
 ] 

Sean Owen commented on SPARK-4267:
--

The warning is from YARN, I believe, rather than Spark. Yeah maybe should be an 
error. 

Your info however points to the problem; I'm sure it's {{-Dnumbers=one two 
three}}. {{Utils.splitCommandString}} strips quotes as it parses them, so will 
turn it into {{-Dnumbers=one two three}} so the command is becoming {{java 
-Dnumbers=one two three ...}} and this isn't valid.

I suggest that {{Utils.splitCommandString}} not strip the quotes that it 
parses, so that the reconstructed command line is exactly like the original. 
It's just splitting, not interpreting the command. This also seems less 
surprising. PR coming to demonstrate.

 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
 --

 Key: SPARK-4267
 URL: https://issues.apache.org/jira/browse/SPARK-4267
 Project: Spark
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA

 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 
 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:
 {code}
  ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn 
 -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0
 {code}
 Then Spark on YARN fails to launch jobs with NPE.
 {code}
 $ bin/spark-shell --master yarn-client
 scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line 
 = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a 
 + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
 java.lang.NullPointerException
   
   
 
 at 
 org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
 at 
 org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291)   
   
   
  
 at 
 org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
 at $iwC$$iwC$$iwC$$iwC.init(console:13)   
   
   
 
 at $iwC$$iwC$$iwC.init(console:18)
 at $iwC$$iwC.init(console:20) 
   
   
 
 at $iwC.init(console:22)
 at init(console:24)   
   
   
 
 at .init(console:28)
 at .clinit(console)   
   
   
 
 at .init(console:7)
 at .clinit(console)   
   
   
 
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   
 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   
   
   
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) 
   
   
  
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at

[jira] [Resolved] (SPARK-2105) SparkUI doesn't remove active stages that failed


 [ 
https://issues.apache.org/jira/browse/SPARK-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2105.
--
   Resolution: Fixed
Fix Version/s: 1.1.0

It appears this is considered fixed by that commit, for 1.1.0, and it has not 
been reproducible otherwise.

 SparkUI doesn't remove active stages that failed
 

 Key: SPARK-2105
 URL: https://issues.apache.org/jira/browse/SPARK-2105
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.1.0


 If a stage fails because its tasks cannot be serialized, for instance, the 
 failed stage remains in the Active Stages section forever. This is because 
 the StageCompleted event is never posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-24 Thread Sven Krasser (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sven Krasser updated SPARK-5395:

Description: 
During job execution a large number of Python worker accumulates eventually 
causing YARN to kill containers for being over their memory allocation (in the 
case below that is about 8G for executors plus 6G for overhead per container). 

In this instance, at the time of killing the container 97 pyspark.daemon 
processes had accumulated.

{noformat}
2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
(Logging.scala:logInfo(59)) - Container marked as failed: 
container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1421692415636_0052_01_30 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
pyspark.daemon
|- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
pyspark.daemon
|- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
pyspark.daemon
[...]
{noformat}

The configuration used uses 64 containers with 2 cores each.

Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c

Mailinglist discussion: 
https://www.mail-archive.com/user@spark.apache.org/msg20102.html

  was:
During job execution a large number of Python worker accumulates eventually 
causing YARN to kill containers for being over their memory allocation (in the 
case below that is about 8G for executors plus 6G for overhead per container). 

In this instance, at the time of killing the container 97 pyspark.daemon 
processes had accumulated.

{noformat}
2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
(Logging.scala:logInfo(59)) - Container marked as failed: 
container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1421692415636_0052_01_30 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
pyspark.daemon
|- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
pyspark.daemon
|- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
pyspark.daemon
[...]
{noformat}

Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c

Mailinglist discussion: 
https://www.mail-archive.com/user@spark.apache.org/msg20102.html


 Large number of Python workers causing resource depletion
 -

 Key: SPARK-5395
 URL: https://issues.apache.org/jira/browse/SPARK-5395
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: AWS ElasticMapReduce
Reporter: Sven Krasser

 During job execution a large number of Python worker accumulates eventually 
 causing YARN to kill containers for being over their memory allocation (in 
 the case below that is about 8G for executors plus 6G for overhead per 
 container). 
 In this instance, at the time of killing the container 97 pyspark.daemon 
 processes had accumulated.
 {noformat}
 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
 (Logging.scala:logInfo(59)) - Container marked as failed: 
 container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
 Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
 running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
 physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
 container.
 Dump of the process-tree for container_1421692415636_0052_01_30 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
 VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
 pyspark.daemon
 |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
 pyspark.daemon
 |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
 pyspark.daemon
   [...]
 {noformat}
 The configuration used uses 64 containers with 2 cores each.
 Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
 Mailinglist discussion:

[jira] [Created] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-24 Thread Sven Krasser (JIRA)

Sven Krasser created SPARK-5395:
---

 Summary: Large number of Python workers causing resource depletion
 Key: SPARK-5395
 URL: https://issues.apache.org/jira/browse/SPARK-5395
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: AWS ElasticMapReduce
Reporter: Sven Krasser


During job execution a large number of Python worker accumulates eventually 
causing YARN to kill containers for being over their memory allocation (in the 
case below that is about 8G for executors plus 6G for overhead per container). 

In this instance, at the time of killing the container 97 pyspark.daemon 
processes had accumulated.

{noformat}
2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
(Logging.scala:logInfo(59)) - Container marked as failed: 
container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1421692415636_0052_01_30 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
pyspark.daemon
|- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
pyspark.daemon
|- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
pyspark.daemon
[...]
{noformat}

Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c

Mailinglist discussion: 
https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-24 Thread Muhammad-Ali A'rabi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290523#comment-14290523
 ] 

Muhammad-Ali A'rabi commented on SPARK-5226:


That's right. For very huge data, it won't be a good implementation.
It is O(log n), actually. In preprocessing phase, we created a sorted map or 
something, and with a radius, we can retrieve all points with less distance in 
O(log n).
If we use the first implementation, for each region query we have to calculate 
lots of distances, and some of them are surely calculated before.
We can have both ways implemented, and user may use any of them depending on 
their need.
We can also use vector with norm and use the upper bound. But I don't trust 
this method and have to test it.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2285) Give various TaskEndReason subclass more descriptive names


 [ 
https://issues.apache.org/jira/browse/SPARK-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2285:
---
Assignee: (was: Reynold Xin)

 Give various TaskEndReason subclass more descriptive names
 --

 Key: SPARK-2285
 URL: https://issues.apache.org/jira/browse/SPARK-2285
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Priority: Minor

 It is just strange to have org.apache.spark.Success be a TaskEndReason. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2285) Give various TaskEndReason subclass more descriptive names


 [ 
https://issues.apache.org/jira/browse/SPARK-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2285:
---
Component/s: Spark Core

 Give various TaskEndReason subclass more descriptive names
 --

 Key: SPARK-2285
 URL: https://issues.apache.org/jira/browse/SPARK-2285
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Priority: Minor

 It is just strange to have org.apache.spark.Success be a TaskEndReason. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2285) Give various TaskEndReason subclass more descriptive names


[ 
https://issues.apache.org/jira/browse/SPARK-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290502#comment-14290502
 ] 

Reynold Xin commented on SPARK-2285:


Hey Sean - I was thinking TaskSuccess, TaskFailed, etc, would be much better 
than Success, since Success can mean a lot of things, without looking up the 
heritage.

 Give various TaskEndReason subclass more descriptive names
 --

 Key: SPARK-2285
 URL: https://issues.apache.org/jira/browse/SPARK-2285
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Minor

 It is just strange to have org.apache.spark.Success be a TaskEndReason. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5396) Syntax error in spark scripts on windows.

2015-01-24 Thread Vladimir Protsenko (JIRA)

Vladimir Protsenko created SPARK-5396:
-

 Summary: Syntax error in spark scripts on windows.
 Key: SPARK-5396
 URL: https://issues.apache.org/jira/browse/SPARK-5396
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Window 7 and Window 8.1.
Reporter: Vladimir Protsenko


I made the following steps: 

1. downloaded and installed Scala 2.11.5 
2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 
3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean 
package (in git bash) 

After installation tried to run spark-shell.cmd in cmd shell and it says there 
is a syntax error in file. The same with spark-shell2.cmd, spark-submit.cmd and 
 spark-submit2.cmd.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3471) Automatic resource manager for SparkContext in Scala?

[
https://issues.apache.org/jira/browse/SPARK-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-3471.
--
Resolution: Not a Problem

This is about adding some kind of try-with-resources equivalent for Scala? No,
there isn't one. I know of the ARM library that provides this functionality:
https://github.com/jsuereth/scala-arm In terms of what the Spark code has to
do to enable resource management with {{SparkContext}}, there's nothing to do.
It implements {{Closeable}} but even that is not necessary for this library to
work. So it's something a user app could include if really desired. I don't
think there is a change to Spark needed here.

Automatic resource manager for SparkContext in Scala?
-

Key: SPARK-3471
URL: https://issues.apache.org/jira/browse/SPARK-3471
Project: Spark
Issue Type: New Feature
Components: Spark Core
Affects Versions: 1.0.2
Reporter: Shay Rojansky
Priority: Minor

After discussion in SPARK-2972, it seems like a good idea to add automatic
resource management semantics to SparkContext (i.e. with in Python
(SPARK-3458), Closeable/AutoCloseable in Java (SPARK-3470)).
I have no knowledge of Scala whatsoever, but a quick search seems to indicate
that there isn't a standard mechanism for this - someone with real Scala
knowledge should take a look and make a decision...

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5383) support alias for udfs with multi output columns

2015-01-24 Thread wangfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-5383:
---
Summary: support alias for udfs with multi output columns  (was: Multi 
alias names support)

 support alias for udfs with multi output columns
 

 Key: SPARK-5383
 URL: https://issues.apache.org/jira/browse/SPARK-5383
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei

 now spark sql does not support multi alias names, The following sql failed in 
 spark-sql:
 select key as (k1, k2), value as (v1, v2) from src limit 5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5383) support alias for udfs with multi output columns

2015-01-24 Thread wangfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-5383:
---
Description: 
when a udf output multi columns, now we can not use alias for them in 
spark-sql, see this flowing sql:

select stack(1, key, value, key, value) as (a, b, c, d) from src limit 5;


  was:
now spark sql does not support multi alias names, The following sql failed in 
spark-sql:
select key as (k1, k2), value as (v1, v2) from src limit 5



 support alias for udfs with multi output columns
 

 Key: SPARK-5383
 URL: https://issues.apache.org/jira/browse/SPARK-5383
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei

 when a udf output multi columns, now we can not use alias for them in 
 spark-sql, see this flowing sql:
 select stack(1, key, value, key, value) as (a, b, c, d) from src limit 5;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3430) Introduce ValueIncrementableHashMapAccumulator to compute Histogram and other statistical metrics


 [ 
https://issues.apache.org/jira/browse/SPARK-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3430.
--
Resolution: Won't Fix

PR says this is WontFix

 Introduce ValueIncrementableHashMapAccumulator to compute Histogram and other 
 statistical metrics
 -

 Key: SPARK-3430
 URL: https://issues.apache.org/jira/browse/SPARK-3430
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Suraj Satishkumar Sheth

 Pull request : https://github.com/apache/spark/pull/2314
 Currently, we don't have a Hash map which can be used as an accumulator to 
 produce Histogram or distribution. This class will provide a customized 
 HashMap implemetation whose value can be incremented.
 e.g. map+=(a,1), map+=(a,6) will lead to (a,7)
 This can have various applications like computation of Histograms, Sampling 
 Strategy generation, Statistical metric computation, in MLLib, etc.
 Example usage :
 val map  = sc.accumulableCollection(new 
 ValueIncrementableHashMapAccumulator[Int]())
 
 var countMap = sc.broadcast(map)
 
 data.foreach(record = {
   var valArray = record.split(\t)
   var valString = 
   var i = 0
   var tuple = (0,1L)
   countMap.value += tuple
   for(valString - valArray) {
 i = i+1
 try{
   valString.toDouble
   var tuple = (i,1L)
   countMap.value += tuple
 }
 catch {
   case ioe: Exception = None
 }
 
   }
 })



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5396) Syntax error in spark scripts on windows.

2015-01-24 Thread Vladimir Protsenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Protsenko updated SPARK-5396:
--
Attachment: windows8.1.png
windows7.png

 Syntax error in spark scripts on windows.
 -

 Key: SPARK-5396
 URL: https://issues.apache.org/jira/browse/SPARK-5396
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Window 7 and Window 8.1.
Reporter: Vladimir Protsenko
 Attachments: windows7.png, windows8.1.png


 I made the following steps: 
 1. downloaded and installed Scala 2.11.5 
 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 
 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean 
 package (in git bash) 
 After installation tried to run spark-shell.cmd in cmd shell and it says 
 there is a syntax error in file. The same with spark-shell2.cmd, 
 spark-submit.cmd and  spark-submit2.cmd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5396) Syntax error in spark scripts on windows.

2015-01-24 Thread Vladimir Protsenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Protsenko updated SPARK-5396:
--
Description: 
I made the following steps: 

1. downloaded and installed Scala 2.11.5 
2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 
3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean 
package (in git bash) 

After installation tried to run spark-shell.cmd in cmd shell and it says there 
is a syntax error in file. The same with spark-shell2.cmd, spark-submit.cmd and 
 spark-submit2.cmd.

!windows7.png!

  was:
I made the following steps: 

1. downloaded and installed Scala 2.11.5 
2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 
3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean 
package (in git bash) 

After installation tried to run spark-shell.cmd in cmd shell and it says there 
is a syntax error in file. The same with spark-shell2.cmd, spark-submit.cmd and 
 spark-submit2.cmd.




 Syntax error in spark scripts on windows.
 -

 Key: SPARK-5396
 URL: https://issues.apache.org/jira/browse/SPARK-5396
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Window 7 and Window 8.1.
Reporter: Vladimir Protsenko
 Attachments: windows7.png, windows8.1.png


 I made the following steps: 
 1. downloaded and installed Scala 2.11.5 
 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 
 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean 
 package (in git bash) 
 After installation tried to run spark-shell.cmd in cmd shell and it says 
 there is a syntax error in file. The same with spark-shell2.cmd, 
 spark-submit.cmd and  spark-submit2.cmd.
 !windows7.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3489) support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params


 [ 
https://issues.apache.org/jira/browse/SPARK-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3489:
-
Priority: Minor  (was: Major)
Target Version/s:   (was: 1.2.0)

This should be a pull request rather than diff pasted in comments. What's the 
use case for this vs two zips?

 support rdd.zip(rdd1, rdd2,...) with variable number of rdds as params
 --

 Key: SPARK-3489
 URL: https://issues.apache.org/jira/browse/SPARK-3489
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Mohit Jaggi
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access


 [ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3621.
--
Resolution: Not a Problem

Given the discussion, this is best solved by reading data directly at the 
workers, rather than involving the driver, or it is already solvable by 
broadcasting values collected on the driver. It won't be possible to broadcast 
an RDD, in any event.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3195) Can you add some statistics to do logistic regression better in mllib?


 [ 
https://issues.apache.org/jira/browse/SPARK-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3195.
--
Resolution: Invalid

 Can you add some statistics to do logistic regression better in mllib?
 --

 Key: SPARK-3195
 URL: https://issues.apache.org/jira/browse/SPARK-3195
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: miumiu
Priority: Minor
   Original Estimate: 1m
  Remaining Estimate: 1m

 HI，
 In logistic regression model practice，Test of regression coefficient and 
 whole model fitting are very important.Can you add some effective support on 
 these Aspects?
 Such as,The likelihood ratio test or the wald test is offer used for test 
 of coefficient,and the Hosmer-Lemeshow test is used for evaluate the model 
 fitting.
 Learning that we have ROC and Precision-Recall already,but can you also 
 provide KS statistic,which is mostly used in Model evaluation aspect?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2442) Add a Hadoop Writable serializer


 [ 
https://issues.apache.org/jira/browse/SPARK-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2442.
--
Resolution: Duplicate

 Add a Hadoop Writable serializer
 

 Key: SPARK-2442
 URL: https://issues.apache.org/jira/browse/SPARK-2442
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan

 Using data read from hadoop files in shuffles can cause exceptions with the 
 following stacktrace:
 {code}
 java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
   at 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
   at 
 org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
   at 
 org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:679)
 {code}
 This though seems to go away if Kyro serializer is used. I am wondering if 
 adding a Hadoop-writables friendly serializer makes sense as it is likely to 
 perform better than Kyro without registration, since Writables don't 
 implement Serializable - so the serialization might not be the most efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5397) Assigning aliases to several return values of an UDF

2015-01-24 Thread Max (JIRA)

Max created SPARK-5397:
--

 Summary: Assigning aliases to several return values of an UDF
 Key: SPARK-5397
 URL: https://issues.apache.org/jira/browse/SPARK-5397
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Max


The query with following syntax is no valid SQL in Spark due to the assigment 
of multiple aliases.
So it seems not possible for me to port former HiveQL queries with UDFs 
returning multiple values to Spark SQL.

Query 
 
SELECT my_function(param_one, param_two) AS (return_one, return_two,
return_three) 
FROM my_table; 

Error 
 
Unsupported language features in query: SELECT my_function(param_one,
param_two) AS (return_one, return_two, return_three) 
FROM my_table; 

TOK_QUERY 
  TOK_FROM 
TOK_TABREF 
  TOK_TABNAME 
my_table 
TOK_SELECT 
  TOK_SELEXPR 
TOK_FUNCTION 
  my_function 
  TOK_TABLE_OR_COL 
param_one 
  TOK_TABLE_OR_COL 
param_two 
return_one 
return_two 
return_three 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3852) Document spark.driver.extra* configs


[ 
https://issues.apache.org/jira/browse/SPARK-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290597#comment-14290597
 ] 

Apache Spark commented on SPARK-3852:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4185

 Document spark.driver.extra* configs
 

 Key: SPARK-3852
 URL: https://issues.apache.org/jira/browse/SPARK-3852
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Or

 They are not documented...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3859) Use consistent config names for duration (with units!)


[ 
https://issues.apache.org/jira/browse/SPARK-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290600#comment-14290600
 ] 

Sean Owen commented on SPARK-3859:
--

I double-checked that all of the config properties that are expressed in time, 
like timeouts, durations, and rates, have their units documented in 
{{configuration.md}}. IMHO it's probably not worth adding 20 new properties and 
deprecating 20 and supporting both just to add the units to the property name.

 Use consistent config names for duration (with units!)
 --

 Key: SPARK-3859
 URL: https://issues.apache.org/jira/browse/SPARK-3859
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or

 There are many configs in Spark that refer to some unit of time. However, 
 from the first glance it is unclear what these units are. We should find a 
 consistent way to append the units to the end of these config names and 
 deprecate the old ones in favor of the more consistent ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3875) Add TEMP DIRECTORY configuration

[
https://issues.apache.org/jira/browse/SPARK-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290602#comment-14290602
]

Sean Owen commented on SPARK-3875:
--

You can already set {{java.io.tmpdir}}, without making a new property, and that
will control where all Java code puts its temp files. There is already
{{spark.local.dir}} which sounds like exactly what you're suggesting. This gets
set to a big fast disk because it's where things like shuffle files go.

Is the question here perhaps whether a few bits of code that don't use
{{spark.local.dir}} should use it? Yes, it looks like it's used by
{{HttpBroadcast.scala}} and downloading dependencies.

Add TEMP DIRECTORY configuration

Key: SPARK-3875
URL: https://issues.apache.org/jira/browse/SPARK-3875
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Liu

Currently, the Spark uses java.io.tmpdir to find the /tmp/ directory.
Then, the /tmp/ directory is used to
1. Setup the HTTP File Server
2. Broadcast directory
3. Fetch Dependency files or jars by Executors
The size of the /tmp/ directory will keep growing. The free space of the
system disk will be less.
I think we could add a configuration spark.tmp.dir in conf/spark-env.sh or
conf/spark-defaults.conf to set this particular directory. Let's say, set the
directory to a data disk.
If spark.tmp.dir is not set, use the default java.io.tmpdir

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5383) support alias for udfs with multi output columns


[ 
https://issues.apache.org/jira/browse/SPARK-5383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290630#comment-14290630
 ] 

Apache Spark commented on SPARK-5383:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/4186

 support alias for udfs with multi output columns
 

 Key: SPARK-5383
 URL: https://issues.apache.org/jira/browse/SPARK-5383
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei

 when a udf output multi columns, now we can not use alias for them in 
 spark-sql, see this flowing sql:
 select stack(1, key, value, key, value) as (a, b, c, d) from src limit 5;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4283) Spark source code does not correctly import into eclipse


 [ 
https://issues.apache.org/jira/browse/SPARK-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4283.
--
Resolution: Won't Fix

I suggest resolving this as WontFix since the Maven build is correct and 
supported, and we have instructions about how to successfully use Spark with 
Eclipse: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-Eclipse

 Spark source code does not correctly import into eclipse
 

 Key: SPARK-4283
 URL: https://issues.apache.org/jira/browse/SPARK-4283
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Yang Yang
Priority: Minor
 Attachments: spark_eclipse.diff


 when I import spark src into eclipse, either by mvn eclipse:eclipse, then 
 import existing general projects or import existing maven projects, it 
 does not recognize the project as a scala project. 
 I am adding a new plugin , so import works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3439) Add Canopy Clustering Algorithm

2015-01-24 Thread Muhammad-Ali A'rabi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290634#comment-14290634
 ] 

Muhammad-Ali A'rabi commented on SPARK-3439:


Possible implementation:

{code:scala}
import org.apache.spark.mllib.linalg._
import java.util.HashMap

val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0), 
Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0), 
Array(0, 0, 1.1))
val vs = vas.map(Vectors.dense(_))

val t1 = 1.0
val t2 = 0.5

// starting canopy
val map = new HashMap[Vector, Vector] // map from data to clusters
val set = new HashMap[Vector, Boolean] // the set
for(v - vs) set.put(v, true)
for(v - vs) {
if(set.get(v)) {
val dists = vs.map{ x = (x, Vectors.sqdist(x, v)) }
dists.foreach { case (x, d) =
if(d  t1) map.put(x, v)
if(d  t2) set.put(x, false)
}
}
}
{code}

The algorithm is working with arrays and lists, but all of them could be 
converted to RDD.

 Add Canopy Clustering Algorithm
 ---

 Key: SPARK-3439
 URL: https://issues.apache.org/jira/browse/SPARK-3439
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Assignee: Muhammad-Ali A'rabi
Priority: Minor

 The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
 It is often used as a preprocessing step for the K-means algorithm or the 
 Hierarchical clustering algorithm. It is intended to speed up clustering 
 operations on large data sets, where using another algorithm directly may be 
 impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3782) Direct use of log4j in AkkaUtils interferes with certain logging configurations


[ 
https://issues.apache.org/jira/browse/SPARK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290594#comment-14290594
 ] 

Sean Owen commented on SPARK-3782:
--

Aha, I think there's a good point here. 

Looks like this method was added to the log4j shim in slf4j 1.7.6: 
https://github.com/qos-ch/slf4j/commit/004b5d4879a079f3d6f610b7fe339a0fad7d4831 

So it should be fine if you use log4j-over-slf4j 1.7.6+ in your app. Spark 
references slf4j 1.7.5 though. Although I don't think it will matter if you use 
a different version, we could update Spark's slf4j to 1.7.6 at least, to be 
really consistent. 1.7.10 is the latest in fact.

 Direct use of log4j in AkkaUtils interferes with certain logging 
 configurations 
 

 Key: SPARK-3782
 URL: https://issues.apache.org/jira/browse/SPARK-3782
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Martin Gilday

 AkkaUtils is calling setLevel on Logger from log4j. This causes issues when 
 using another implementation of SLF4J such as logback as 
 log4j-over-slf4j.jars implementation of this class does not contain this 
 method on Logger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3782) Direct use of log4j in AkkaUtils interferes with certain logging configurations


[ 
https://issues.apache.org/jira/browse/SPARK-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290595#comment-14290595
 ] 

Apache Spark commented on SPARK-3782:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4184

 Direct use of log4j in AkkaUtils interferes with certain logging 
 configurations 
 

 Key: SPARK-3782
 URL: https://issues.apache.org/jira/browse/SPARK-3782
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Martin Gilday

 AkkaUtils is calling setLevel on Logger from log4j. This causes issues when 
 using another implementation of SLF4J such as logback as 
 log4j-over-slf4j.jars implementation of this class does not contain this 
 method on Logger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3148) Update global variables of HttpBroadcast so that multiple SparkContexts can coexist


 [ 
https://issues.apache.org/jira/browse/SPARK-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3148.
--
Resolution: Won't Fix

PR says this is WontFix

 Update global variables of HttpBroadcast so that multiple SparkContexts can 
 coexist
 ---

 Key: SPARK-3148
 URL: https://issues.apache.org/jira/browse/SPARK-3148
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor

 Update global variables of HttpBroadcast so that multiple SparkContexts can 
 coexist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error


[ 
https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290632#comment-14290632
 ] 

Sean Owen commented on SPARK-2348:
--

[~chiragtodarka] [~Xierqi] The resolution proposed here sounds like the one for 
SPARK-4161. It looks like a similar, parallel change in {{windows-utils.cmd}} 
might fix this? You make a pull request on github to propose the change rather 
than write the diff here.

 In Windows having a enviorinment variable named 'classpath' gives error
 ---

 Key: SPARK-2348
 URL: https://issues.apache.org/jira/browse/SPARK-2348
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Windows 7 Enterprise
Reporter: Chirag Todarka
Assignee: Chirag Todarka
Priority: Critical

 Operating System:: Windows 7 Enterprise
 If having enviorinment variable named 'classpath' gives then starting 
 'spark-shell' gives below error::
 mydir\spark\binspark-shell
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler 
 acces
 sed before init set up.  Assuming no postInit code.
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 Exception in thread main java.lang.AssertionError: assertion failed: null
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
 la:202)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
 kILoop.scala:929)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
 Loader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5297) JavaStreamingContext.fileStream won't work because type info isn't propagated


 [ 
https://issues.apache.org/jira/browse/SPARK-5297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5297:
-
Summary: JavaStreamingContext.fileStream won't work because type info isn't 
propagated  (was: File Streams do not work with custom key/values)

 JavaStreamingContext.fileStream won't work because type info isn't propagated
 -

 Key: SPARK-5297
 URL: https://issues.apache.org/jira/browse/SPARK-5297
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Leonidas Fegaras
Assignee: Saisai Shao
  Labels: backport-needed
 Fix For: 1.3.0


 The following code:
 {code}
 stream_context.K,V,SequenceFileInputFormatK,VfileStream(directory)
 .foreachRDD(new FunctionJavaPairRDDK,V,Void() {
  public Void call ( JavaPairRDDK,V rdd ) throws Exception {
  for ( Tuple2K,V x: rdd.collect() )
  System.out.println(# +x._1+ +x._2);
  return null;
  }
   });
 stream_context.start();
 stream_context.awaitTermination();
 {code}
 for custom (serializable) classes K and V compiles fine but gives an error
 when I drop a new hadoop sequence file in the directory:
 {quote}
 15/01/17 09:13:59 ERROR scheduler.JobScheduler: Error generating jobs for 
 time 1421507639000 ms
 java.lang.ClassCastException: java.lang.Object cannot be cast to 
 org.apache.hadoop.mapreduce.InputFormat
   at 
 org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:91)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
   at 
 org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:236)
   at 
 org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:234)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:234)
   at 
 org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:128)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:296)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288)
   at scala.Option.orElse(Option.scala:257)
 {quote}
 The same classes K and V work fine for non-streaming Spark:
 {code}
 spark_context.newAPIHadoopFile(path,F.class,K.class,SequenceFileInputFormat.class,conf)
 {code}
 also streaming works fine for TextFileInputFormat.
 The issue is that class manifests are erased to object in the Java file 
 stream constructor, but those are relied on downstream when creating the 
 Hadoop RDD that backs each batch of the file stream.
 https://github.com/apache/spark/blob/v1.2.0/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala#L263
 https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L753



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3439) Add Canopy Clustering Algorithm

2015-01-24 Thread Muhammad-Ali A'rabi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290634#comment-14290634
 ] 

Muhammad-Ali A'rabi edited comment on SPARK-3439 at 1/24/15 2:41 PM:
-

Possible implementation:

{code:java}
import org.apache.spark.mllib.linalg._
import java.util.HashMap

val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0), 
Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0), 
Array(0, 0, 1.1))
val vs = vas.map(Vectors.dense(_))

val t1 = 1.0
val t2 = 0.5

// starting canopy
val map = new HashMap[Vector, Vector] // map from data to clusters
val set = new HashMap[Vector, Boolean] // the set
for(v - vs) set.put(v, true)
for(v - vs) {
if(set.get(v)) {
val dists = vs.map{ x = (x, Vectors.sqdist(x, v)) }
dists.foreach { case (x, d) =
if(d  t1) map.put(x, v)
if(d  t2) set.put(x, false)
}
}
}
{code}

The algorithm is working with arrays and lists, but all of them could be 
converted to RDD.


was (Author: angellandros):
Possible implementation:

{code:scala}
import org.apache.spark.mllib.linalg._
import java.util.HashMap

val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0), 
Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0), 
Array(0, 0, 1.1))
val vs = vas.map(Vectors.dense(_))

val t1 = 1.0
val t2 = 0.5

// starting canopy
val map = new HashMap[Vector, Vector] // map from data to clusters
val set = new HashMap[Vector, Boolean] // the set
for(v - vs) set.put(v, true)
for(v - vs) {
if(set.get(v)) {
val dists = vs.map{ x = (x, Vectors.sqdist(x, v)) }
dists.foreach { case (x, d) =
if(d  t1) map.put(x, v)
if(d  t2) set.put(x, false)
}
}
}
{code}

The algorithm is working with arrays and lists, but all of them could be 
converted to RDD.

 Add Canopy Clustering Algorithm
 ---

 Key: SPARK-3439
 URL: https://issues.apache.org/jira/browse/SPARK-3439
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Assignee: Muhammad-Ali A'rabi
Priority: Minor

 The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
 It is often used as a preprocessing step for the K-means algorithm or the 
 Hierarchical clustering algorithm. It is intended to speed up clustering 
 operations on large data sets, where using another algorithm directly may be 
 impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3754) Spark Streaming fileSystem API is not callable from Java


[ 
https://issues.apache.org/jira/browse/SPARK-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290637#comment-14290637
 ] 

Sean Owen commented on SPARK-3754:
--

Is this the same as the issue reported in 
https://issues.apache.org/jira/browse/SPARK-5297 for fileStream?

 Spark Streaming fileSystem API is not callable from Java
 

 Key: SPARK-3754
 URL: https://issues.apache.org/jira/browse/SPARK-3754
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.1.0
Reporter: holdenk
Assignee: Holden Karau
Priority: Critical

 The Spark Streaming Java API for fileSystem is not callable from Java. We 
 should do something like with how it is handled in the Java Spark Context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.


 [ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4289.
--
Resolution: Not a Problem

I suggest this is NotAProblem, at least not something I can see Spark can 
address. I think that {{toString()}} failing is a minor Hadoop bug really. 
There's the {{:silent}} workaround.

 Creating an instance of Hadoop Job fails in the Spark shell when toString() 
 is called on the instance.
 --

 Key: SPARK-4289
 URL: https://issues.apache.org/jira/browse/SPARK-4289
 Project: Spark
  Issue Type: Bug
Reporter: Corey J. Nolet

 This one is easy to reproduce.
 preval job = new Job(sc.hadoopConfiguration)/pre
 I'm not sure what the solution would be off hand as it's happening when the 
 shell is calling toString() on the instance of Job. The problem is, because 
 of the failure, the instance is never actually assigned to the job val.
 java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
   at 
 scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
   at .init(console:10)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4368) Ceph integration?

[
https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290644#comment-14290644
]

Sean Owen commented on SPARK-4368:
--

I don't think Spark does anything in particular to support GlusterFS; the
message you cite just says it works without any special support. I haven't
heard Ceph come up. Are you suggesting there is some change that needs to be
made to support it? if so I think you should outline how big the change is. I
think the suggestion recently has been that third-party integration projects
belong outside the core project though.

Ceph integration?
-

Key: SPARK-4368
URL: https://issues.apache.org/jira/browse/SPARK-4368
Project: Spark
Issue Type: Bug
Components: Input/Output
Reporter: Serge Smertin

There is a use-case of storing big number of relatively small BLOB objects
(2-20Mb), which has to have some ugly workarounds in HDFS environments. There
is a need to process those BLOBs close to data themselves, so that's why
MapReduce paradigm is good, as it guarantees data locality.
Ceph seems to be one of the systems that maintains both of the properties
(small files and data locality) -
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I
know already that Spark supports GlusterFS -
http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E
So i wonder, could there be an integration with this storage solution and
what could be the effort of doing that?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5309) Reduce Binary/String conversion overhead when reading/writing Parquet files