[jira] [Commented] (SPARK-1551) Spark master does not build in sbt

2014-04-21 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975442#comment-13975442
 ] 

holdenk commented on SPARK-1551:


Sorry about that, I had something dirty locally that added the requirement for 
gangilla.

 Spark master does not build in sbt
 --

 Key: SPARK-1551
 URL: https://issues.apache.org/jira/browse/SPARK-1551
 Project: Spark
  Issue Type: Bug
Reporter: holdenk

 metrics-ganglia is missing



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1552) GraphX performs type comparison incorrectly

2014-04-21 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-1552:
-

 Summary: GraphX performs type comparison incorrectly
 Key: SPARK-1552
 URL: https://issues.apache.org/jira/browse/SPARK-1552
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Ankur Dave


In GraphImpl, mapVertices and outerJoinVertices use a more efficient 
implementation when the map function preserves vertex attribute types. This is 
implemented by comparing the ClassTags of the old and new vertex attribute 
types. However, ClassTags store _erased_ types, so the comparison will return a 
false positive for types with different type parameters, such as Option[Int] 
and Option[Double].

Demo in the Scala shell:

scala import scala.reflect.{classTag, ClassTag}
scala def typesEqual[A: ClassTag, B: ClassTag](a: A, b: B): Boolean = 
classTag[A] equals classTag[B]
scala typesEqual(Some(1), Some(2.0)) // should return false
res2: Boolean = true

We can require richer TypeTags for these methods, or just take a flag from the 
caller specifying whether the types are equal.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1552) GraphX performs type comparison incorrectly

2014-04-21 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-1552:
--

Description: 
In GraphImpl, mapVertices and outerJoinVertices use a more efficient 
implementation when the map function preserves vertex attribute types. This is 
implemented by comparing the ClassTags of the old and new vertex attribute 
types. However, ClassTags store _erased_ types, so the comparison will return a 
false positive for types with different type parameters, such as Option[Int] 
and Option[Double].

Thanks to Pierre-Alexandre Fonta for reporting this bug on the [mailing 
list|http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Cast-error-when-comparing-a-vertex-attribute-after-its-type-has-changed-td4119.html].

Demo in the Scala shell:

scala import scala.reflect.{classTag, ClassTag}
scala def typesEqual[A: ClassTag, B: ClassTag](a: A, b: B): Boolean = 
classTag[A] equals classTag[B]
scala typesEqual(Some(1), Some(2.0)) // should return false
res2: Boolean = true

We can require richer TypeTags for these methods, or just take a flag from the 
caller specifying whether the types are equal.

  was:
In GraphImpl, mapVertices and outerJoinVertices use a more efficient 
implementation when the map function preserves vertex attribute types. This is 
implemented by comparing the ClassTags of the old and new vertex attribute 
types. However, ClassTags store _erased_ types, so the comparison will return a 
false positive for types with different type parameters, such as Option[Int] 
and Option[Double].

Demo in the Scala shell:

scala import scala.reflect.{classTag, ClassTag}
scala def typesEqual[A: ClassTag, B: ClassTag](a: A, b: B): Boolean = 
classTag[A] equals classTag[B]
scala typesEqual(Some(1), Some(2.0)) // should return false
res2: Boolean = true

We can require richer TypeTags for these methods, or just take a flag from the 
caller specifying whether the types are equal.


 GraphX performs type comparison incorrectly
 ---

 Key: SPARK-1552
 URL: https://issues.apache.org/jira/browse/SPARK-1552
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Ankur Dave

 In GraphImpl, mapVertices and outerJoinVertices use a more efficient 
 implementation when the map function preserves vertex attribute types. This 
 is implemented by comparing the ClassTags of the old and new vertex attribute 
 types. However, ClassTags store _erased_ types, so the comparison will return 
 a false positive for types with different type parameters, such as 
 Option[Int] and Option[Double].
 Thanks to Pierre-Alexandre Fonta for reporting this bug on the [mailing 
 list|http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Cast-error-when-comparing-a-vertex-attribute-after-its-type-has-changed-td4119.html].
 Demo in the Scala shell:
 scala import scala.reflect.{classTag, ClassTag}
 scala def typesEqual[A: ClassTag, B: ClassTag](a: A, b: B): Boolean = 
 classTag[A] equals classTag[B]
 scala typesEqual(Some(1), Some(2.0)) // should return false
 res2: Boolean = true
 We can require richer TypeTags for these methods, or just take a flag from 
 the caller specifying whether the types are equal.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1438) Update RDD.sample() API to make seed parameter optional

2014-04-21 Thread Arun Ramakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975486#comment-13975486
 ] 

Arun Ramakrishnan commented on SPARK-1438:
--

pull request at https://github.com/apache/spark/pull/462

 Update RDD.sample() API to make seed parameter optional
 ---

 Key: SPARK-1438
 URL: https://issues.apache.org/jira/browse/SPARK-1438
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Blocker
  Labels: Starter
 Fix For: 1.0.0


 When a seed is not given, it should pick one based on Math.random().
 This needs to be done in Java and Python as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1490) Add kerberos support to the HistoryServer

2014-04-21 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-1490:


Assignee: Thomas Graves

 Add kerberos support to the HistoryServer
 -

 Key: SPARK-1490
 URL: https://issues.apache.org/jira/browse/SPARK-1490
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves

 Now that we have a history server that works on yarn and mesos we should add 
 the ability for it to authenticate via kerberos so that it can read HDFS 
 files without having to be restarted every 24 hours. 
 One solution to this is to have the history server read a keytab file.  The 
 Hadoop UserGroupInformation class has that functionality built in and as long 
 as its using rpc to talk to hdfs it will automatically relogin when it needs 
 to.   If the history server isn't using rpc to talk to hdfs then we would 
 have to add some functionality to relogin approximately every 24 hours 
 (configurable time).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1472) Go through YARN api used in Spark to make sure we aren't using Private Apis

2014-04-21 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975609#comment-13975609
 ] 

Thomas Graves commented on SPARK-1472:
--

So it looks like its currently impossible to use all public interfaces with 
Hadoop. There are some that are limitedprivate that we will have to use.

I filed several jira in Hadoop land to add public interfaces for various things 
that we either need or would be handy for all types of applications. 
https://issues.apache.org/jira/browse/YARN-1953
I should file a couple more jira also as things like UserGroupInformation is 
marked limitedprivate also.

In this jira I will clean up as much as possible mostly in the yarn stable 
code since that is where api's changed scope.

 Go through YARN api used in Spark to make sure we aren't using Private Apis
 ---

 Key: SPARK-1472
 URL: https://issues.apache.org/jira/browse/SPARK-1472
 Project: Spark
  Issue Type: Task
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves

 We need to look through all the yarn api's we are using to make sure they 
 aren't now Private.  If they are private change to not use those api's.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1439) Aggregate Scaladocs across projects

2014-04-21 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-1439:


Assignee: Matei Zaharia

 Aggregate Scaladocs across projects
 ---

 Key: SPARK-1439
 URL: https://issues.apache.org/jira/browse/SPARK-1439
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.0.0


 Apparently there's a Unidoc plugin to put together ScalaDocs across 
 modules: https://github.com/akka/akka/blob/master/project/Unidoc.scala



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1440) Generate JavaDoc instead of ScalaDoc for Java API

2014-04-21 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-1440:


Assignee: Matei Zaharia

 Generate JavaDoc instead of ScalaDoc for Java API
 -

 Key: SPARK-1440
 URL: https://issues.apache.org/jira/browse/SPARK-1440
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.0.0


 It may be possible to use this plugin:  
 https://github.com/typesafehub/genjavadoc



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1554) Update doc overview page to not mention building if you get a pre-built distro

2014-04-21 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1554:


 Summary: Update doc overview page to not mention building if you 
get a pre-built distro
 Key: SPARK-1554
 URL: https://issues.apache.org/jira/browse/SPARK-1554
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia


SBT assembly takes a long time and we should tell people to skip it if they got 
a binary build (which will likely be the most common case).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1202) Add a cancel button in the UI for stages

2014-04-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1202.


Resolution: Fixed

 Add a cancel button in the UI for stages
 --

 Key: SPARK-1202
 URL: https://issues.apache.org/jira/browse/SPARK-1202
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Patrick Wendell
Assignee: Sundeep Narravula
Priority: Critical
 Fix For: 1.0.0


 Seems like this would be really useful for people. It's not that hard, we 
 just need to lookup the jobs associated with the stage and kill them. Might 
 involve exposing some additional API's in SparkContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-918) hadoop-client dependency should be explained for Scala in addition to Java in quickstart

2014-04-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-918.
---

Resolution: Won't Fix

This was fixed as a result of a separate re-factoring of the docs.

 hadoop-client dependency should be explained for Scala in addition to Java in 
 quickstart
 

 Key: SPARK-918
 URL: https://issues.apache.org/jira/browse/SPARK-918
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Patrick Wendell
  Labels: starter
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2014-04-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975914#comment-13975914
 ] 

Cheng Lian commented on SPARK-1529:
---

After some investigation, I came to the conclusion that, unlike adding Tachyon 
support, to allow setting {{spark.local.dir}} to a Hadoop FS location, instead 
of adding something like {{HDFSBlockManager}} / {{HDFSStore}}, we have to 
refactor related local FS access code to leverage HDFS interfaces. And it seems 
hard to make this change incremental. Besides writing shuffle map output, at 
least two places reference {{spark.local.dir}}:

# HTTP broadcasting uses {{spark.local.dir}} as resource root, and access local 
FS with `java.io.File`
# {{FileServerHandler}} accesses {{spark.local.dir}} via {{DiskBlockManager}} 
and reads local file with {{FileSegment}} and {{java.io.File}}

Adding new block manager / store for HDFS can't fix these places. I'm currently 
working on this issue by:

# Refactoring {{FileSegment.file}} from {{java.io.File}} to 
{{org.apache.hadoop.fs.Path}},
# Refactoring {{DiskBlockManager}}, {{DiskStore}}, {{HttpBroadcast}}  
{{FileServerHandler}}  to leverage HDFS interfaces.

Please leave comments if I missed anything or there's simpler ways to 
workaround this.

(PS: We should definitely refactor block manager related code to reduce 
duplicate code and encapsulate more details. Maybe the public interface of 
block manager should only communicate with other component with block IDs and 
storage levels.)

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian
 Fix For: 1.1.0


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-1438) Update RDD.sample() API to make seed parameter optional

2014-04-21 Thread Kevin Tham (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Tham updated SPARK-1438:
--

Comment: was deleted

(was: I can work on this (I'd like to try to submit my first Spark contribution 
:-) ))

 Update RDD.sample() API to make seed parameter optional
 ---

 Key: SPARK-1438
 URL: https://issues.apache.org/jira/browse/SPARK-1438
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Blocker
  Labels: Starter
 Fix For: 1.0.0


 When a seed is not given, it should pick one based on Math.random().
 This needs to be done in Java and Python as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2014-04-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975947#comment-13975947
 ] 

Patrick Wendell commented on SPARK-1529:


[~liancheng] Hey Cheng, the tricky thing here is want to avoid _always_ going 
through the HDFS filesystem interface when people are actually using local 
files. We might need to add an intermediate abstraction to deal with this. We 
already do this elsehwere in the code base, for instance the JobLogger will 
load an output stream either directly form a file or from a hadoop file.

One thing to note is that the requirement here is really only for the shuffle 
files, not for the other uses. But I realize we currently conflate these inside 
of Spark so that might not buy us much. I'll look into this a bit more later.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian
 Fix For: 1.1.0


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1556) jets3t dependency is outdated

2014-04-21 Thread Nan Zhu (JIRA)
Nan Zhu created SPARK-1556:
--

 Summary: jets3t dependency is outdated
 Key: SPARK-1556
 URL: https://issues.apache.org/jira/browse/SPARK-1556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 0.8.1, 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu
 Fix For: 1.0.0


In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines 
S3ServiceException/ServiceException is introduced, however, Spark still relies 
on Jet3st 0.7.x which has no definition of these classes

What I met is that 

[code]

14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
at $iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC.init(console:22)
at $iwC.init(console:24)
at init(console:26)
at .init(console:30)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:793)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:838)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:750)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:598)
at 

[jira] [Commented] (SPARK-1556) jets3t dependency is outdated

2014-04-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975957#comment-13975957
 ] 

Sean Owen commented on SPARK-1556:
--

Actually, why does Spark have a direct dependency on jets3t at all? it is not 
used directly in the code.

If it's only needed at runtime, it can/should be declared that way. But if the 
reason it's there is just for Hadoop, then of course hadoop-client is already 
bringing it in, and should be allowed to bring in the version it wants.

 jets3t dependency is outdated
 -

 Key: SPARK-1556
 URL: https://issues.apache.org/jira/browse/SPARK-1556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu
 Fix For: 1.0.0


 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines 
 S3ServiceException/ServiceException is introduced, however, Spark still 
 relies on Jet3st 0.7.x which has no definition of these classes
 What I met is that 
 [code]
 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
   at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
   at 

[jira] [Commented] (SPARK-1556) jets3t dependency is outdated

2014-04-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975967#comment-13975967
 ] 

Sean Owen commented on SPARK-1556:
--

OK, I partly eat my words. jets3t isn't included by the Hadoop client library 
it appears. It's only included by the Hadoop server-side components. So yeah 
Spark has to include jets3t to make s3:// URLs work in the REPL. FWIW I agree 
with updating the version -- ideally just in the Hadoop 2.2+ profiles. And it 
should be scoperuntime/scope

 jets3t dependency is outdated
 -

 Key: SPARK-1556
 URL: https://issues.apache.org/jira/browse/SPARK-1556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu
 Fix For: 1.0.0


 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines 
 S3ServiceException/ServiceException is introduced, however, Spark still 
 relies on Jet3st 0.7.x which has no definition of these classes
 What I met is that 
 [code]
 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
   at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
   at 

[jira] [Created] (SPARK-1557) Set permissions on event log files/directories

2014-04-21 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-1557:


 Summary: Set permissions on event log files/directories
 Key: SPARK-1557
 URL: https://issues.apache.org/jira/browse/SPARK-1557
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves


We should set the permissions on the event log directories and files so that it 
restricts access to only those users who own them, but could also allow a super 
user to read them so that they could be displayed by the history server in a 
multi-tenant secure environment. 





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2014-04-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13976038#comment-13976038
 ] 

Patrick Wendell commented on SPARK-1529:


One idea proposed by [~adav] was to always use the Hadoop filesystem API, but 
to potentially implement our own version of the local filesystem if we find the 
Hadoop version has performance drawbacks.

Another issue is that we use FileChannel objects directly in the 
`DiskBlockObjectWriter`. After looking through this a bit, the functionality 
there to commit and rewind writes is not actually used anywhere, we could 
probably just remove it.

[~liancheng] I think it would be worth it to look at a version where we just 
take all of the File API's and replace them with Hadoop equivalents. I.e. your 
proposal.



 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian
 Fix For: 1.1.0


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1399) Reason for Stage Failure should be shown in UI

2014-04-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1399.


Resolution: Fixed

 Reason for Stage Failure should be shown in UI
 --

 Key: SPARK-1399
 URL: https://issues.apache.org/jira/browse/SPARK-1399
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Kay Ousterhout
Assignee: Nan Zhu

 Right now, we don't show why a stage failed in the UI.  We have this 
 information, and it would be useful for users to see (e.g., to see that a 
 stage was killed because the job was cancelled).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1399) Reason for Stage Failure should be shown in UI

2014-04-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1399:
---

Fix Version/s: 1.0.0

 Reason for Stage Failure should be shown in UI
 --

 Key: SPARK-1399
 URL: https://issues.apache.org/jira/browse/SPARK-1399
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Kay Ousterhout
Assignee: Nan Zhu
 Fix For: 1.0.0


 Right now, we don't show why a stage failed in the UI.  We have this 
 information, and it would be useful for users to see (e.g., to see that a 
 stage was killed because the job was cancelled).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1539) RDDPage.scala contains RddPage

2014-04-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1539.


   Resolution: Fixed
Fix Version/s: 1.0.0

 RDDPage.scala contains RddPage
 --

 Key: SPARK-1539
 URL: https://issues.apache.org/jira/browse/SPARK-1539
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0


 SPARK-1386 changed RDDPage to RddPage but didn't change the filename. I tried 
 sbt/sbt publish-local. Inside the spark-core jar, the unit name is 
 RDDPage.class and hence I got the following error:
 {code}
 [error] (run-main) java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/storage/RddPage
 java.lang.NoClassDefFoundError: org/apache/spark/ui/storage/RddPage
   at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:59)
   at org.apache.spark.ui.SparkUI.init(SparkUI.scala:52)
   at org.apache.spark.ui.SparkUI.init(SparkUI.scala:42)
   at org.apache.spark.SparkContext.init(SparkContext.scala:215)
   at MovieLensALS$.main(MovieLensALS.scala:38)
   at MovieLensALS.main(MovieLensALS.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.storage.RddPage
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:59)
   at org.apache.spark.ui.SparkUI.init(SparkUI.scala:52)
   at org.apache.spark.ui.SparkUI.init(SparkUI.scala:42)
   at org.apache.spark.SparkContext.init(SparkContext.scala:215)
   at MovieLensALS$.main(MovieLensALS.scala:38)
   at MovieLensALS.main(MovieLensALS.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
 {code}
 This can be fixed after renaming RddPage to RDDPage.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1558) [streaming] Update receiver information to match it with code

2014-04-21 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-1558:


 Summary: [streaming] Update receiver information to match it with 
code
 Key: SPARK-1558
 URL: https://issues.apache.org/jira/browse/SPARK-1558
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1505) [streaming] Add 0.9 to 1.0 migration guide for streaming receiver

2014-04-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1505:
-

Component/s: Streaming

 [streaming] Add 0.9 to 1.0 migration guide for streaming receiver
 -

 Key: SPARK-1505
 URL: https://issues.apache.org/jira/browse/SPARK-1505
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Streaming
Reporter: Tathagata Das
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1558) [streaming] Update receiver information to match it with code

2014-04-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1558:
-

Component/s: Documentation

 [streaming] Update receiver information to match it with code
 -

 Key: SPARK-1558
 URL: https://issues.apache.org/jira/browse/SPARK-1558
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1504) [streaming] Add deployment subsection to streaming

2014-04-21 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1504:
-

Component/s: Streaming

 [streaming] Add deployment subsection to streaming
 --

 Key: SPARK-1504
 URL: https://issues.apache.org/jira/browse/SPARK-1504
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1457) Change APIs for training algorithms to take optimizer as parameter

2014-04-21 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-1457:
--

Assignee: DB Tsai

 Change APIs for training algorithms to take optimizer as parameter 
 ---

 Key: SPARK-1457
 URL: https://issues.apache.org/jira/browse/SPARK-1457
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai

 Currently, the training api has signature like LogisticRegressionWithSGD. 
 If we want to use another optimizer, we've two options, either adding new api 
 like LogisticRegressionWithNewOptimizer which causes 99% of the code 
 duplication, or we can re-factorize the api to take the optimizer as an 
 option like the following. 
 class LogisticRegression private (
 var optimizer: Optimizer)
   extends GeneralizedLinearAlgorithm[LogisticRegressionModel]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1516) Yarn Client should not call System.exit, should throw exception instead.

2014-04-21 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-1516:
--

Assignee: DB Tsai

 Yarn Client should not call System.exit, should throw exception instead.
 

 Key: SPARK-1516
 URL: https://issues.apache.org/jira/browse/SPARK-1516
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: DB Tsai
Assignee: DB Tsai

 People submit spark job inside their application to yarn cluster using spark 
 yarn client, and it's not desirable to call System.exit in yarn client which 
 will terminate the parent application as well.
 We should throw exception instead, and people can determine which action they 
 want to take given the exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1559) Add conf dir to CLASSPATH in compute-classpath.sh dependent on whether SPARK_CONF_DIR is set

2014-04-21 Thread Albert Chu (JIRA)
Albert Chu created SPARK-1559:
-

 Summary: Add conf dir to CLASSPATH in compute-classpath.sh 
dependent on whether SPARK_CONF_DIR is set
 Key: SPARK-1559
 URL: https://issues.apache.org/jira/browse/SPARK-1559
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Albert Chu
Priority: Minor
 Attachments: SPARK-1559.patch

bin/load-spark-env.sh loads spark-env.sh from SPARK_CONF_DIR if it is set, or 
from $parent_dir/conf if it is not set.

However, in compute-classpath.sh, the CLASSPATH adds $FWDIR/conf to the 
CLASSPATH regardless if SPARK_CONF_DIR is set.

Attached patch fixes this.  Pull request on github will also be sent.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-693) Let deploy scripts set alternate conf, work directories

2014-04-21 Thread Albert Chu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Chu updated SPARK-693:
-

Attachment: SPARK-693.patch

We required this support in our environment.  Attached is my patch to implement 
this for Spark 1.0.0.  Git pull request will be sent too.

 Let deploy scripts set alternate conf, work directories
 ---

 Key: SPARK-693
 URL: https://issues.apache.org/jira/browse/SPARK-693
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.6.2
Reporter: David Chiang
Priority: Minor
 Attachments: SPARK-693.patch


 Currently SPARK_CONF_DIR is overridden in spark-config.sh, and 
 start-slaves.sh doesn't allow the user to pass a -d option in to set the work 
 directory. Allowing this is a small change and makes it possible to have 
 multiple clusters running at once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1543) Add ADMM for solving Lasso (and elastic net) problem

2014-04-21 Thread Shuo Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-1543:
--

Description: 
This PR introduces the Alternating Direction Method of Multipliers (ADMM) for 
solving Lasso (elastic net, in fact) in mllib. 

ADMM is capable of solving a class of composite minimization problems in a 
distributed way. Specifically for Lasso (if only L1-regularization) or 
elastic-net (both L1- and L2- regularization), in each iteration, it requires 
solving independent systems of linear equations on each partition and a 
subsequent soft-threholding operation on the driver machine. Unlike SGD, it is 
a deterministic algorithm (except for the random partition). Details can be 
found in the [S. Boyd's 
paper](http://www.stanford.edu/~boyd/papers/admm_distr_stats.html).

The linear algebra operations mainly rely on the Breeze library, particularly, 
it applies `breeze.linalg.cholesky` to perform cholesky decomposition on each 
partition to solve the linear system.

I tried to follow the organization of existing Lasso implementation. However, 
as ADMM is also a good fit for similar optimization problems, e.g., (sparse) 
logistic regression, it may be worth reorganizing and putting ADMM into a 
separate section.


  was:
This PR introduces the Alternating Direction Method of Multipliers (ADMM) for 
solving Lasso (elastic net, in fact) in mllib. 

ADMM is capable of solving a class of composite minimization problems in a 
distributed way. Specifically for Lasso (if only L1-regularization) or 
elastic-net (both L1- and L2- regularization), it requires solving independent 
systems of linear equations on each partition and a soft-threholding operation 
on the driver. Unlike SGD, it is a deterministic algorithm (except for the 
random partition). Details can be found in the [S. Boyd's 
paper](http://www.stanford.edu/~boyd/papers/admm_distr_stats.html).

The linear algebra operations mainly rely on the Breeze library, particularly, 
it applies `breeze.linalg.cholesky` to perform cholesky decomposition on each 
partition to solve the linear system.

I tried to follow the organization of existing Lasso implementation. However, 
as ADMM is also a good fit for similar optimization problems, e.g., (sparse) 
logistic regression, it may worth to re-organize and put ADMM into a separate 
section.

PR: https://github.com/apache/spark/pull/458



 Add ADMM for solving Lasso (and elastic net) problem
 

 Key: SPARK-1543
 URL: https://issues.apache.org/jira/browse/SPARK-1543
 Project: Spark
  Issue Type: New Feature
Reporter: Shuo Xiang
Priority: Minor
  Labels: features
   Original Estimate: 168h
  Remaining Estimate: 168h

 This PR introduces the Alternating Direction Method of Multipliers (ADMM) for 
 solving Lasso (elastic net, in fact) in mllib. 
 ADMM is capable of solving a class of composite minimization problems in a 
 distributed way. Specifically for Lasso (if only L1-regularization) or 
 elastic-net (both L1- and L2- regularization), in each iteration, it requires 
 solving independent systems of linear equations on each partition and a 
 subsequent soft-threholding operation on the driver machine. Unlike SGD, it 
 is a deterministic algorithm (except for the random partition). Details can 
 be found in the [S. Boyd's 
 paper](http://www.stanford.edu/~boyd/papers/admm_distr_stats.html).
 The linear algebra operations mainly rely on the Breeze library, 
 particularly, it applies `breeze.linalg.cholesky` to perform cholesky 
 decomposition on each partition to solve the linear system.
 I tried to follow the organization of existing Lasso implementation. However, 
 as ADMM is also a good fit for similar optimization problems, e.g., (sparse) 
 logistic regression, it may be worth reorganizing and putting ADMM into a 
 separate section.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1561) sbt/sbt assembly generates too many local files

2014-04-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1561:


 Summary: sbt/sbt assembly generates too many local files
 Key: SPARK-1561
 URL: https://issues.apache.org/jira/browse/SPARK-1561
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


Running `find ./ | wc -l` after `sbt/sbt assembly` returned 

564365

This hits the default limit of #INode of an 8GB EXT FS (the default volume size 
for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` 
on such a partition.

Most of the small files are under assembly/target/streams and the same folder 
under examples/.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1562) Exclude internal catalyst classes from scaladoc, or make them package private

2014-04-21 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1562:
--

 Summary: Exclude internal catalyst classes from scaladoc, or make 
them package private
 Key: SPARK-1562
 URL: https://issues.apache.org/jira/browse/SPARK-1562
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Patrick Wendell
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0


Michael - this is up to you. But I noticed there are a ton of internal catalyst 
types that show up in our scaladoc. I'm not sure if you mean these to be 
user-facing API's. If not, it might be good to hide them from the docs or make 
them package private.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1440) Generate JavaDoc instead of ScalaDoc for Java API

2014-04-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1440.


Resolution: Fixed

 Generate JavaDoc instead of ScalaDoc for Java API
 -

 Key: SPARK-1440
 URL: https://issues.apache.org/jira/browse/SPARK-1440
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.0.0


 It may be possible to use this plugin:  
 https://github.com/typesafehub/genjavadoc



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1439) Aggregate Scaladocs across projects

2014-04-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1439.


Resolution: Fixed

 Aggregate Scaladocs across projects
 ---

 Key: SPARK-1439
 URL: https://issues.apache.org/jira/browse/SPARK-1439
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.0.0


 Apparently there's a Unidoc plugin to put together ScalaDocs across 
 modules: https://github.com/akka/akka/blob/master/project/Unidoc.scala



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1332) Improve Spark Streaming's Network Receiver and InputDStream API for future stability

2014-04-21 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1332.


   Resolution: Fixed
Fix Version/s: 1.0.0

 Improve Spark Streaming's Network Receiver and InputDStream API for future 
 stability
 

 Key: SPARK-1332
 URL: https://issues.apache.org/jira/browse/SPARK-1332
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.9.0
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker
 Fix For: 1.0.0


 The current Network Receiver API makes it slightly complicated to right a new 
 receiver as one needs to create an instance of BlockGenerator as shown in 
 SocketReceiver 
 https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala#L51
 Exposing the BlockGenerator interface has made it harder to improve the 
 receiving process. The API of NetworkReceiver (which was not a very stable 
 API anyways) needs to be change if we are to ensure future stability. 
 Additionally, the functions like streamingContext.socketStream that create 
 input streams, return DStream objects. That makes it hard to expose 
 functionality (say, rate limits) unique to input dstreams. They should return 
 InputDStream or NetworkInputDStream.



--
This message was sent by Atlassian JIRA
(v6.2#6252)