from:"Sean Owen $JIRA$"

[jira] [Created] (SPARK-1387) Update build plugins, avoid plugin version warning, centralize versions

2014-04-01 Thread Sean Owen (JIRA)

Sean Owen created SPARK-1387:


 Summary: Update build plugins,  avoid plugin version warning, 
centralize versions
 Key: SPARK-1387
 URL: https://issues.apache.org/jira/browse/SPARK-1387
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
Priority: Minor


Another handful of small build changes to organize and standardize a bit, and 
avoid warnings:

- Update Maven plugin versions for good measure
- Since plugins need maven 3.0.4 already, require it explicitly (3.0.4 had 
some bugs anyway)
- Use variables to define versions across dependencies where they should move 
in lock step
- ... and make this consistent between Maven/SBT




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957113#comment-13957113
 ] 

Sean Owen commented on SPARK-1355:
--

April Fools, apparently. Though this was opened on 30 March? 

 Switch website to the Apache CMS
 

 Key: SPARK-1355
 URL: https://issues.apache.org/jira/browse/SPARK-1355
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Joe Schaefer

 Jekyll is ancient history useful for small blogger sites and little else.  
 Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
 .md files and interfaces with pygments for code highlighting.  Thrift 
 recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-04-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957330#comment-13957330
 ] 

Sean Owen commented on SPARK-1388:
--

Yes this should be resolved as a duplicate instead.

 ConcurrentModificationException in hadoop_common exposed by Spark
 -

 Key: SPARK-1388
 URL: https://issues.apache.org/jira/browse/SPARK-1388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Nishkam Ravi
 Attachments: nravi_Conf_Spark-1388.patch


 The following exception occurs non-deterministically:
 java.util.ConcurrentModificationException
 at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
 at java.util.HashMap$KeyIterator.next(HashMap.java:960)
 at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
 at java.util.HashSet.init(HashSet.java:117)
 at org.apache.hadoop.conf.Configuration.init(Configuration.java:671)
 at org.apache.hadoop.mapred.JobConf.init(JobConf.java:439)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:154)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
 at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
 at org.apache.spark.scheduler.Task.run(Task.scala:53)
 at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957450#comment-13957450
 ] 

Sean Owen commented on SPARK-1391:
--

Oops yes I mean offset of course. Good investigation there. I am also not sure 
why the index would not show. 

 BlockManager cannot transfer blocks larger than 2G in size
 --

 Key: SPARK-1391
 URL: https://issues.apache.org/jira/browse/SPARK-1391
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Shuffle
Affects Versions: 1.0.0
Reporter: Shivaram Venkataraman

 If a task tries to remotely access a cached RDD block, I get an exception 
 when the block size is  2G. The exception is pasted below.
 Memory capacities are huge these days ( 60G), and many workflows depend on 
 having large blocks in memory, so it would be good to fix this bug.
 I don't know if the same thing happens on shuffles if one transfer (from 
 mapper to reducer) is  2G.
 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
 message
 java.lang.ArrayIndexOutOfBoundsException
 at 
 it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
 at 
 org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
 at 
 org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
 at 
 org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
 at 
 org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
 at 
 org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
 at 
 org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
 at 
 org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at 
 org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at 
 org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
 at 
 org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
 at 
 org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957949#comment-13957949
 ] 

Sean Owen commented on SPARK-1391:
--

Of course! Hardly my issue. Well, you could try my patch that replaces fastutil 
with alternatives. I doubt the standard ByteArrayOutputStream does differently 
though?

But we are always going to have a problem in that a Java byte array can only be 
so big because of the size of an int, regardless of stream position issues. 
This one could be deeper. 

 BlockManager cannot transfer blocks larger than 2G in size
 --

 Key: SPARK-1391
 URL: https://issues.apache.org/jira/browse/SPARK-1391
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Shuffle
Affects Versions: 1.0.0
Reporter: Shivaram Venkataraman

 If a task tries to remotely access a cached RDD block, I get an exception 
 when the block size is  2G. The exception is pasted below.
 Memory capacities are huge these days ( 60G), and many workflows depend on 
 having large blocks in memory, so it would be good to fix this bug.
 I don't know if the same thing happens on shuffles if one transfer (from 
 mapper to reducer) is  2G.
 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
 message
 java.lang.ArrayIndexOutOfBoundsException
 at 
 it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
 at 
 org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
 at 
 org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
 at 
 org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
 at 
 org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
 at 
 org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
 at 
 org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
 at 
 org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at 
 org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at 
 org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
 at 
 org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
 at 
 org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1441) Compile Spark Core error with Hadoop 0.23.x

2014-04-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13963016#comment-13963016
 ] 

Sean Owen commented on SPARK-1441:
--

I see, I had taken the yarn-alpha profile to be the slightly-misnamed profile 
you would use when building the whole project with 0.23, and building this 
distro means building everything. At least, that's a fairly fine solution, no? 
you should set yarn.version to 0.23.9 too.

 Compile Spark Core error with Hadoop 0.23.x
 ---

 Key: SPARK-1441
 URL: https://issues.apache.org/jira/browse/SPARK-1441
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: witgo
 Attachments: mvn.log, sbt.log






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1357) [MLLIB] Annotate developer and experimental API's

2014-04-09 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13963999#comment-13963999
]

Sean Owen commented on SPARK-1357:
--

I know I'm late to this party, but I just had a look and wanted to throw out a
few last minute ideas.

(Do you not want to just declare all of MLlib experimental? is it really 1.0?
that's a fairly significant set of shackles to put on for a long time.)

OK, that aside, I have two suggestions to mark as experimental:

1. ALS Rating object assumes users and items are Int. I suggest it will be
eventually interesting to support String, or at least switch to Long.

2. Per old MLLIB-29, I feel pretty certain that ClassificationModel can't
return RDD[Double], and will want to support returning a distribution over
labels at some point. Similarly the input to it and RegressionModel seems like
it will have to change to encompass something more than Vector to properly
allow for categorical values. DecisionTreeModel has the same issue but is
experimental (and doesn't integrate with these APIs?)

The point is not so much whether one agrees with these, but whether there is a
non-trivial chance of wanting to change something this year.

Other parts that I'm interested in personally look pretty strong. Humbly
submitted.

[MLLIB] Annotate developer and experimental API's
-

Key: SPARK-1357
URL: https://issues.apache.org/jira/browse/SPARK-1357
Project: Spark
Issue Type: Sub-task
Components: MLlib
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Xiangrui Meng
Fix For: 1.0.0

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-04-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964638#comment-13964638
 ] 

Sean Owen commented on SPARK-1406:
--

Yes I understand transformations can be described in PMML. Do you mean parsing 
a transformation described in PMML and implementing the transformation? Yes 
that goes hand in hand with supporting import of a model in general.

I would merely suggest this is a step that comes after several others in order 
of priority, like:
- implementing feature transformations in the abstract in the code base, 
separately from the idea of PMML
- implementing some form of model import via JPMML
- implementing more functional in the Model classes to give a reason to want to 
import an external model into MLlib

... and to me this is less useful at this point than export too. I say this 
because the power of MLlib/Spark right now is perceived to be model building, 
making it more producer than consumer at this stage.

 PMML model evaluation support via MLib
 --

 Key: SPARK-1406
 URL: https://issues.apache.org/jira/browse/SPARK-1406
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Thomas Darimont

 It would be useful if spark would provide support the evaluation of PMML 
 models (http://www.dmg.org/v4-2/GeneralStructure.html).
 This would allow to use analytical models that were created with a 
 statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
 would perform the actual model evaluation for a given input tuple. The PMML 
 model would then just contain the parameterization of an analytical model.
 Other projects like JPMML-Evaluator do a similar thing.
 https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1357) [MLLIB] Annotate developer and experimental API's

2014-04-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964649#comment-13964649
 ] 

Sean Owen commented on SPARK-1357:
--

Yeah I think it's reasonable to say that the core ALS API is only in terms of 
numeric IDs and leave a higher-level translation to the caller. Longs give that 
much more space to hash into.

The cost in terms of memory of something like a String is just a reference, 
so roughly the same as a Double anyway. I think the more important question is 
whether Double is too hacky API-wise as a representation of fundamentally 
non-numeric data. That's up for debate, but yeah the question here is more 
about reserving the right to change.

I'll submit a PR that marks the items I mention as experimental, for 
consideration. See if it seems reasonable.

 [MLLIB] Annotate developer and experimental API's
 -

 Key: SPARK-1357
 URL: https://issues.apache.org/jira/browse/SPARK-1357
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Xiangrui Meng
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1437) Jenkins should build with Java 6

2014-04-12 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967522#comment-13967522
]

Sean Owen commented on SPARK-1437:
--

Pardon my boldness in pushing this onto your plate pwendell, but might be a
very quick fix in Jenkins.

If Travis CI is going to be activated, it can definitely be configured to build
with Java 6 and 7 both.

Jenkins should build with Java 6

Key: SPARK-1437
URL: https://issues.apache.org/jira/browse/SPARK-1437
Project: Spark
Issue Type: Bug
Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
Assignee: Patrick Wendell
Priority: Minor
Labels: javac, jenkins
Attachments: Screen Shot 2014-04-07 at 22.53.56.png

Apologies if this was already on someone's to-do list, but I wanted to track
this, as it bit two commits in the last few weeks.
Spark is intended to work with Java 6, and so compiles with source/target
1.6. Java 7 can correctly enforce Java 6 language rules and emit Java 6
bytecode. However, unless otherwise configured with -bootclasspath, javac
will use its own (Java 7) library classes. This means code that uses classes
in Java 7 will be allowed to compile, but the result will fail when run on
Java 6.
This is why you get warnings like ...
Using /usr/java/jdk1.7.0_51 as default JAVA_HOME.
...
[warn] warning: [options] bootstrap class path not set in conjunction with
-source 1.6
The solution is just to tell Jenkins to use Java 6. This may be stating the
obvious, but it should just be a setting under Configure for
SparkPullRequestBuilder. In our Jenkinses, JDK 6/7/8 are set up; if it's not
an option already I'm guessing it's not too hard to get Java 6 configured on
the Amplab machines.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1479) building spark on 2.0.0-cdh4.4.0 failed

2014-04-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967784#comment-13967784
 ] 

Sean Owen commented on SPARK-1479:
--

yarn is the slightly more appropriate profile, but, read: 
https://github.com/apache/spark/pull/151
What Spark doesn't quite support is YARN beta and that's what you've got on 
your hands here.

FWIW I am in favor of the change in this PR to make it all work. Soon, support 
for YARN alpha/beta can just go away anyway.

If you are interested in CDH, the best thing is moving to CDH5, which already 
has Spark set up in standalone mode, and which has YARN stable. It also works 
with CDH 4.6 in standalone mode as a parcel.

 building spark on 2.0.0-cdh4.4.0 failed
 ---

 Key: SPARK-1479
 URL: https://issues.apache.org/jira/browse/SPARK-1479
 Project: Spark
  Issue Type: Question
 Environment: 2.0.0-cdh4.4.0
 Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
 spark 0.9.1
 java version 1.6.0_32
Reporter: jackielihf
 Attachments: mvn.log


 [INFO] 
 
 [ERROR] Failed to execute goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on 
 project spark-yarn-alpha_2.10: Execution scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. CompileFailed - 
 [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
 goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile 
 (scala-compile-first) on project spark-yarn-alpha_2.10: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed.
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
   at 
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
   at 
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
   at 
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
   at 
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
   at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
 Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed.
   at 
 org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:110)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
   ... 19 more
 Caused by: Compilation failed
   at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:76)
   at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:35)
   at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:29)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply$mcV$sp(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:101)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.compileScala$1(AggressiveCompile.scala:70)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:88)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:60)

[jira] [Created] (SPARK-1488) Resolve scalac feature warnings during build

2014-04-14 Thread Sean Owen (JIRA)

Sean Owen created SPARK-1488:


 Summary: Resolve scalac feature warnings during build
 Key: SPARK-1488
 URL: https://issues.apache.org/jira/browse/SPARK-1488
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
Priority: Minor


For your consideration: scalac currently notes a number of feature warnings 
during compilation:

{code}
[warn] there were 65 feature warning(s); re-run with -feature for details
{code}

Warnings are like:

{code}
[warn] 
/Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261:
 implicit conversion method rddToPairRDDFunctions should be enabled
[warn] by making the implicit value scala.language.implicitConversions visible.
[warn] This can be achieved by adding the import clause 'import 
scala.language.implicitConversions'
[warn] or by setting the compiler option -language:implicitConversions.
[warn] See the Scala docs for value scala.language.implicitConversions for a 
discussion
[warn] why the feature should be explicitly enabled.
[warn]   implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: 
RDD[(K, V)]) =
[warn]^
{code}

scalac is suggesting that it's just best practice to explicitly enable certain 
language features by importing them where used.

The accompanying PR simply adds the imports it suggests (and squashes one other 
Java warning along the way). This leaves just deprecation warnings in the build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1439) Aggregate Scaladocs across projects

2014-04-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971662#comment-13971662
 ] 

Sean Owen commented on SPARK-1439:
--

I had a run at this today. First I tried Maven-based formulas, but didn't quite 
do the trick. I made some progress with unidoc although not all the way. Maybe 
an SBT expert can help me figure how to finish it.

*Maven*

http://stackoverflow.com/questions/12301620/how-to-generate-an-aggregated-scaladoc-for-a-maven-site

This works, but, generates *javadoc* for everything, including Scala source. 
The resulting javadoc is not so helpful. It also complains a lot about not 
finding references since javadoc doesn't quite understand links in the same way.

*Maven #2*

You can also invoke the scala-maven-plugin 'doc' goal as part of the site 
generation:

{code:xml}
  reporting
plugins
  ...
  plugin
groupIdnet.alchim31.maven/groupId
artifactIdscala-maven-plugin/artifactId
reportSets
  reportSet
reports
  reportdoc/report
/reports
  /reportSet
/reportSets
  /plugin
/plugins
  /reporting
{code}

It lacks a goal like aggregate that the javadoc plugin has, which takes care 
of combining everything into one set of docs. This only generates scaladoc in 
each module in exploded format.

*Unidoc / SBT*

It is almost as easy as:

- adding the plugin to plugins.sbt: {{addSbtPlugin(com.eed3si9n % 
sbt-unidoc % 0.3.0)}}
- {{import sbtunidoc.Plugin.\_}} and {{UnidocKeys.\_}} in SparkBuild.scala
- adding ++ unidocSettings to rootSettings in SparkBuild.scala

but it was also necessary to:

- {{SPARK_YARN=true}} and {{SPARK_HADOOP_VERSION=2.2.0}}, for example, to make 
YARN scaladoc work
- Exclude {{yarn-alpha}} since scaladoc doesn't like the collision of class 
names:

{code}
  def rootSettings = sharedSettings ++ unidocSettings ++ Seq(
unidocProjectFilter in (ScalaUnidoc, unidoc) := inAnyProject -- 
inProjects(yarnAlpha),
publish := {}
  )
{code}

I still get SBT errors since I think this is not quite correctly finessing the 
build. But it seems almost there.


 Aggregate Scaladocs across projects
 ---

 Key: SPARK-1439
 URL: https://issues.apache.org/jira/browse/SPARK-1439
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Matei Zaharia
 Fix For: 1.0.0


 Apparently there's a Unidoc plugin to put together ScalaDocs across 
 modules: https://github.com/akka/akka/blob/master/project/Unidoc.scala



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1464) Update MLLib Examples to Use Breeze

2014-04-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972342#comment-13972342
 ] 

Sean Owen commented on SPARK-1464:
--

This is a duplicate of https://issues.apache.org/jira/browse/SPARK-1462 which 
is resolved now.

 Update MLLib Examples to Use Breeze
 ---

 Key: SPARK-1464
 URL: https://issues.apache.org/jira/browse/SPARK-1464
 Project: Spark
  Issue Type: Task
  Components: MLlib
Reporter: Patrick Wendell
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 1.0.0


 If we want to deprecate the vector class we need to update all of the 
 examples to use Breeze.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6

2014-04-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972397#comment-13972397
 ] 

Sean Owen commented on SPARK-1520:
--

Madness. One wild guess is that the breeze .jar files have something in 
META-INF that, when merged together into the assembly jar, conflicts with other 
META-INF items. In particular I'm thinking of MANIFEST.MF entries. It's worth 
diffing those if you can from before and after. However this would still 
require that Java 7 and 6 behave differently with respect to the entries, to 
explain your findings. It's possible.

Your last comment however suggests it's something strange with the byte code 
that gets output for a few classes. Java 7 is stricter about byte code. For 
example: 
https://weblogs.java.net/blog/fabriziogiudici/archive/2012/05/07/understanding-subtle-new-behaviours-jdk-7
However I would think these would manifest as quite different errors.

What about running with -verbose:class to print classloading messages? it might 
point directly to what's failing to load, if that's it.

Of course you can always build with Java 6 since that's supposed to be all 
that's supported/required now (see my other JIRA about making Jenkins do this), 
although I agree that it would be nice to get to the bottom of this, as there 
is no obvious reason this shouldn't work.

 Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 *Isolation*
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 I've found that if I just unpack and re-pack the jar, it sometimes works:
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6

2014-04-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972523#comment-13972523
 ] 

Sean Owen commented on SPARK-1520:
--

Regarding large numbers of files: are there INDEX.LST files used anywhere in 
the jars? If this gets munged or truncated while building the assembly jar, 
that might cause all kinds of havoc. It could be omitted.

http://docs.oracle.com/javase/7/docs/technotes/guides/jar/jar.html#Index_File_Specification

 Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 *Isolation*
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 I've found that if I just unpack and re-pack the jar (using `jar` from java 6 
 or 7) it always works:
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}
 I also noticed something of note. The Breeze package contains single 
 directories that have huge numbers of files in them (e.g. 2000+ class files 
 in one directory). It's possible we are hitting some weird bugs/corner cases 
 with compatibility of the internal storage format of the jar itself.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

2014-04-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973067#comment-13973067
 ] 

Sean Owen commented on SPARK-1527:
--

{{toString()}} returns {{getPath()}} which may still be relative. 
{{getAbsolutePath()}} is better, but even {{getCanonicalPath()}} may be better 
still.

 rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, 
 rootDir1
 ---

 Key: SPARK-1527
 URL: https://issues.apache.org/jira/browse/SPARK-1527
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Ye Xianjin
Priority: Minor
  Labels: starter
   Original Estimate: 24h
  Remaining Estimate: 24h

 In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala
   val rootDir0 = Files.createTempDir()
   rootDir0.deleteOnExit()
   val rootDir1 = Files.createTempDir()
   rootDir1.deleteOnExit()
   val rootDirs = rootDir0.getName + , + rootDir1.getName
 rootDir0 and rootDir1 are in system's temporary directory. 
 rootDir0.getName will not get the full path of the directory but the last 
 component of the directory. When passing to DiskBlockManage constructor, the 
 DiskBlockerManger creates directories in pwd not the temporary directory.
 rootDir0.toString will fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

2014-04-17 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973091#comment-13973091
]

Sean Owen commented on SPARK-1527:
--

If the paths are only used locally, then an absolute path never hurts (except
to be a bit longer). I assume that since these are references to a temp
directory that is by definition only valid locally, that absolute path is the
right thing to use.

In other cases, similar logic may apply. I could imagine in some cases the
right thing to do is transmit a relative path.

rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0,
rootDir1
---

Key: SPARK-1527
URL: https://issues.apache.org/jira/browse/SPARK-1527
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 0.9.0
Reporter: Ye Xianjin
Priority: Minor
Labels: starter
Original Estimate: 24h
Remaining Estimate: 24h

In core/src/test/scala/org/apache/storage/DiskBlockManagerSuite.scala
val rootDir0 = Files.createTempDir()
rootDir0.deleteOnExit()
val rootDir1 = Files.createTempDir()
rootDir1.deleteOnExit()
val rootDirs = rootDir0.getName + , + rootDir1.getName
rootDir0 and rootDir1 are in system's temporary directory.
rootDir0.getName will not get the full path of the directory but the last
component of the directory. When passing to DiskBlockManage constructor, the
DiskBlockerManger creates directories in pwd not the temporary directory.
rootDir0.toString will fix this issue.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1527) rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

2014-04-17 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973148#comment-13973148
]

Sean Owen commented on SPARK-1527:
--

There are a number of other uses of File.getName(), but a quick glance suggests
all the others are appropriate.

There are a number of other uses of File.toString(), almost all in tests. I
suspect the Files in question already have absolute paths, and that even
relative paths happen to work fine in a test since the working dir doesn't
change. So those could change, but are probably not a concern.

The only one that gave me pause was the use in HttpBroadcast.scala, though I
suspect it turns out to work fine for similar reasons.

If reviewers are interested in changing the toString()s I'll test and submit a
PR for that.

rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0,
rootDir1
---

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1556) jets3t dependency is outdated

2014-04-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975957#comment-13975957
 ] 

Sean Owen commented on SPARK-1556:
--

Actually, why does Spark have a direct dependency on jets3t at all? it is not 
used directly in the code.

If it's only needed at runtime, it can/should be declared that way. But if the 
reason it's there is just for Hadoop, then of course hadoop-client is already 
bringing it in, and should be allowed to bring in the version it wants.

 jets3t dependency is outdated
 -

 Key: SPARK-1556
 URL: https://issues.apache.org/jira/browse/SPARK-1556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu
 Fix For: 1.0.0


 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines 
 S3ServiceException/ServiceException is introduced, however, Spark still 
 relies on Jet3st 0.7.x which has no definition of these classes
 What I met is that 
 [code]
 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
   at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
   at

[jira] [Commented] (SPARK-1556) jets3t dependency is outdated

2014-04-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975967#comment-13975967
 ] 

Sean Owen commented on SPARK-1556:
--

OK, I partly eat my words. jets3t isn't included by the Hadoop client library 
it appears. It's only included by the Hadoop server-side components. So yeah 
Spark has to include jets3t to make s3:// URLs work in the REPL. FWIW I agree 
with updating the version -- ideally just in the Hadoop 2.2+ profiles. And it 
should be scoperuntime/scope

 jets3t dependency is outdated
 -

 Key: SPARK-1556
 URL: https://issues.apache.org/jira/browse/SPARK-1556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu
 Fix For: 1.0.0


 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines 
 S3ServiceException/ServiceException is introduced, however, Spark still 
 relies on Jet3st 0.7.x which has no definition of these classes
 What I met is that 
 [code]
 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
   at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
   at

[jira] [Commented] (SPARK-1605) Improve mllib.linalg.Vector

2014-04-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979380#comment-13979380
 ] 

Sean Owen commented on SPARK-1605:
--

I think this was on purpose, to try to hide breeze as an implementation detail, 
at least in public APIs?

 Improve mllib.linalg.Vector
 ---

 Key: SPARK-1605
 URL: https://issues.apache.org/jira/browse/SPARK-1605
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sandeep Singh

 We can make current Vector a wrapper around Breeze.linalg.Vector ?
 I want to work on this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1629) Spark Core missing commons-lang dependence

2014-04-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980878#comment-13980878
 ] 

Sean Owen commented on SPARK-1629:
--

I don't see any usage of Commons Lang in the whole project?
Tachyon uses commons-lang3 but it also brings it in as a dependency.

  Spark Core missing commons-lang dependence
 ---

 Key: SPARK-1629
 URL: https://issues.apache.org/jira/browse/SPARK-1629
 Project: Spark
  Issue Type: Bug
Reporter: witgo





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1638) Executors fail to come up if spark.executor.extraJavaOptions is set

2014-04-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13983156#comment-13983156
 ] 

Sean Owen commented on SPARK-1638:
--

Almost certainly a duplicate of https://issues.apache.org/jira/browse/SPARK-1609

 Executors fail to come up if spark.executor.extraJavaOptions is set 
 --

 Key: SPARK-1638
 URL: https://issues.apache.org/jira/browse/SPARK-1638
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
 Environment: Bring up a cluster in EC2 using spark-ec2 scripts
Reporter: Kalpit Shah
 Fix For: 1.0.0


 If you try to launch a PySpark shell with spark.executor.extraJavaOptions 
 set to -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, the executors never come up on 
 any of the workers.
 I see the following error in log file :
 Spark Executor Command: /usr/lib/jvm/java/bin/java -cp 
 /root/c3/lib/*::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar:
  -XX:+UseCompressedOops -XX:+UseCompressedStrings -verbose:gc 
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms13312M -Xmx13312M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@HOSTNAME:45429/user/CoarseGrainedScheduler 7 HOSTNAME 
 4 akka.tcp://sparkWorker@HOSTNAME:39727/user/Worker 
 app-20140423224526-
 
 Unrecognized VM option 'UseCompressedOops -XX:+UseCompressedStrings 
 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps'
 Error: Could not create the Java Virtual Machine.
 Error: A fatal exception has occurred. Program will exit.
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1663) Spark Streaming docs code has several small errors

2014-04-29 Thread Sean Owen (JIRA)

Sean Owen created SPARK-1663:


 Summary: Spark Streaming docs code has several small errors
 Key: SPARK-1663
 URL: https://issues.apache.org/jira/browse/SPARK-1663
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.1
Reporter: Sean Owen
Priority: Minor


The changes are easiest to elaborate in the PR, which I will open shortly.

Those changes raised a few little questions about the API too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1663) Spark Streaming docs code has several small errors

2014-04-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984466#comment-13984466
 ] 

Sean Owen commented on SPARK-1663:
--

PR: https://github.com/apache/spark/pull/589

 Spark Streaming docs code has several small errors
 --

 Key: SPARK-1663
 URL: https://issues.apache.org/jira/browse/SPARK-1663
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.1
Reporter: Sean Owen
Priority: Minor
  Labels: streaming

 The changes are easiest to elaborate in the PR, which I will open shortly.
 Those changes raised a few little questions about the API too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1684) Merge script should standardize SPARK-XXX prefix

2014-04-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985770#comment-13985770
 ] 

Sean Owen commented on SPARK-1684:
--

(Can I be pedantic and suggest standardizing to SPARK-XXX ? this is how issues 
are reported in other projects, like HIVE-123 and MAPREDUCE-234)

 Merge script should standardize SPARK-XXX prefix
 

 Key: SPARK-1684
 URL: https://issues.apache.org/jira/browse/SPARK-1684
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Minor
 Fix For: 1.1.0


 If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: 
 Issue we should convert it to SPARK XXX: Issue



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1693) Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0

2014-05-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986471#comment-13986471
 ] 

Sean Owen commented on SPARK-1693:
--

I suspect this occurs because two copies of the servlet API jars are included 
from two sources, and one of those sources includes jar signing information in 
its manifest. The resulting merged jar has collisions in the signing 
information and it no longer matches.

If that's right, the fastest way to avoid this is usually to drop signing 
information that is in the manifest since it is not helpful in the assembly 
jar. Of course it's ideal to avoid merging two copies of the same dependencies, 
since only one can be included, and that's why we see some [warn] in the build. 
In just about all cases it is harmless since they are actually copies of the 
same version of the same classes.

I will look into what ends up in the manifest.

 Most of the tests throw a java.lang.SecurityException when spark built for 
 hadoop 2.3.0 , 2.4.0 
 

 Key: SPARK-1693
 URL: https://issues.apache.org/jira/browse/SPARK-1693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Attachments: log.txt


 {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0  
 log.txt{code}
 The log: 
 {code}
 UnpersistSuite:
 - unpersist RDD *** FAILED ***
   java.lang.SecurityException: class javax.servlet.FilterRegistration's 
 signer information does not match signer information of other classes in the 
 same package
   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1693) Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0

2014-05-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986490#comment-13986490
 ] 

Sean Owen commented on SPARK-1693:
--

I think this is traceable to a case of jar conflict. I am not sure whether the 
ultimate cause is signing, but it doesn't matter, since we should simply 
resolve the conflict. (But I think something like that may be at play, since 
one of the problem dependencies is from Eclipse's jetty, and there is an 
Eclipse cert in the manifest at META-INF/ECLIPSEF.RSA...) Anyway.

This is another fun jar hell puzzler, albeit one with a probable solution. The 
basic issues is that Jetty brings in the Servlet 3.0 API:

{code}
[INFO] |  +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
[INFO] |  |  +- 
org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
{code}

... but in Hadoop 2.3.0+, so does Hadoop client:

{code}
[INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.4.0:compile
...
[INFO] |  |  +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.4.0:compile
[INFO] |  |  |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.4.0:compile
[INFO] |  |  | +- javax.xml.bind:jaxb-api:jar:2.2.2:compile
...
[INFO] |  |  | +- javax.servlet:servlet-api:jar:2.5:compile
{code}

Eclipse is naughty for packaging the same API classes in a different artifact, 
rather than just using javax.servlet:servlet-api 3.0. There may be a reason for 
that, which is what worries me. In theory Servlet 3.0 is a superset of 2.5, so 
Hadoop's client should be happy with the 3.0 API on the classpath.

So solution #1 to try is to exclude javax.servlet:servlet-api from the 
hadoop-client dependency. It won't affect earlier versions at all since they 
don't have this dependency, and therefore need not even be version-specific.

Hadoop 2.3+ then in theory should happily find Eclipse's Servlet 3.0 API 
classes and work as normal. I give that about an 90% chance of being true.

Solution #2 is more theoretically sound. We should really exclude Jetty's 
custom copy of Servlet 3.0, and depend on javax.servlet:servlet-api:jar:3.0 as 
a runtime dependency. This should transparently override Hadoop 2.3+'s version, 
and still work for Hadoop. Messing with Jetty increases the chance of another 
snag. Probability of success: 80%


Li would you be able to try either of those ideas?

 Most of the tests throw a java.lang.SecurityException when spark built for 
 hadoop 2.3.0 , 2.4.0 
 

 Key: SPARK-1693
 URL: https://issues.apache.org/jira/browse/SPARK-1693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Attachments: log.txt


 {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0  
 log.txt{code}
 The log: 
 {code}
 UnpersistSuite:
 - unpersist RDD *** FAILED ***
   java.lang.SecurityException: class javax.servlet.FilterRegistration's 
 signer information does not match signer information of other classes in the 
 same package
   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1693) Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0

2014-05-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986507#comment-13986507
 ] 

Sean Owen commented on SPARK-1693:
--

Correct. I thought you had tried option #1 since you had suggested it too. Can 
you not test it in the same way that you discovered the problem? If you try, 
I'd suggest the #2 option, because if that works, it is a more robust solution.

 Most of the tests throw a java.lang.SecurityException when spark built for 
 hadoop 2.3.0 , 2.4.0 
 

 Key: SPARK-1693
 URL: https://issues.apache.org/jira/browse/SPARK-1693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Attachments: log.txt


 {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0  
 log.txt{code}
 The log: 
 {code}
 UnpersistSuite:
 - unpersist RDD *** FAILED ***
   java.lang.SecurityException: class javax.servlet.FilterRegistration's 
 signer information does not match signer information of other classes in the 
 same package
   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1693) Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0

2014-05-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986523#comment-13986523
 ] 

Sean Owen commented on SPARK-1693:
--

No, that's exactly the problem as far as I can tell. My suggestion is:

- In core, exclude org.eclipse.jetty.orbit:javax.servlet from 
org.eclipse.jetty:jetty-server
- Declare a runtime-scope dependency in core, on 
javax.servlet:servlet-api:jar:3.0
- (And that will happen to override Hadoop's javax.servlet:servlet-api:jar:2.5 
!)
- Update the SBT build as closely as possible to match

 Most of the tests throw a java.lang.SecurityException when spark built for 
 hadoop 2.3.0 , 2.4.0 
 

 Key: SPARK-1693
 URL: https://issues.apache.org/jira/browse/SPARK-1693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Attachments: log.txt


 {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0  
 log.txt{code}
 The log: 
 {code}
 UnpersistSuite:
 - unpersist RDD *** FAILED ***
   java.lang.SecurityException: class javax.servlet.FilterRegistration's 
 signer information does not match signer information of other classes in the 
 same package
   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1693) Most of the tests throw a java.lang.SecurityException when spark built for hadoop 2.3.0 , 2.4.0

2014-05-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986631#comment-13986631
 ] 

Sean Owen commented on SPARK-1693:
--

Yes looks very close to what I had in mind; I have two suggestions:
- To be ultra-safe, use version 3.0.0 of the Servlet API not 3.0.1
- Maybe drop a comment in the parent pom about why this dependency exists -- 
even just a reference to this JIRA

Does it work for you then? fingers crossed.

 Most of the tests throw a java.lang.SecurityException when spark built for 
 hadoop 2.3.0 , 2.4.0 
 

 Key: SPARK-1693
 URL: https://issues.apache.org/jira/browse/SPARK-1693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Attachments: log.txt


 {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0  
 log.txt{code}
 The log: 
 {code}
 UnpersistSuite:
 - unpersist RDD *** FAILED ***
   java.lang.SecurityException: class javax.servlet.FilterRegistration's 
 signer information does not match signer information of other classes in the 
 same package
   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1698) Improve spark integration

2014-05-02 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987686#comment-13987686
]

Sean Owen commented on SPARK-1698:
--

(Copying an earlier comment that went to the mailing list, but didn't make it
here:)

#1 and #2 are not relevant the issue of jar size. These can be problems in
general, but don't think there have been issues attributable to file clashes.
Shading has mechanisms to deal with this anyway.

#3 is a problem in general too, but is not specific to shading. Where versions
collide, build processes like Maven and shading must be used to resolve them.
But this happens regardless of whether you shade a fat jar.

#4 is a real problem specific to Java 6. It does seem like it will be important
to identify and remove more unnecessary dependencies to work around it.

But shading per se is not the problem, and it is important to make a packaged
jar for the app. What are you proposing? Dependencies to be removed?

Improve spark integration
-

Key: SPARK-1698
URL: https://issues.apache.org/jira/browse/SPARK-1698
Project: Spark
Issue Type: Improvement
Components: Build, Deploy
Reporter: Guoqiang Li
Assignee: Guoqiang Li
Fix For: 1.0.0

Use the shade plugin to create a big JAR with all the dependencies can cause
a few problems
1. Missing jar's meta information
2. Some file is covered, eg: plugin.xml
3. Different versions of the jar may co-exist
4. Too big, java 6 does not support

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1698) Improve spark integration

2014-05-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13987698#comment-13987698
 ] 

Sean Owen commented on SPARK-1698:
--

What is the suggested change in this particular JIRA? I saw the PR, which seems 
to replace the shade with assembly plugin. Given the reference to 
https://issues.scala-lang.org/browse/SI-6660 are you suggesting that your 
assembly change packages differently, by putting jars in jars? Yes, the issue 
you link to is exactly the kind of problem that can occur with this approach. 
It comes up a bit in Hadoop as well. Even though it is in theory a fine way to 
do things. But is that what you're getting at?

 Improve spark integration
 -

 Key: SPARK-1698
 URL: https://issues.apache.org/jira/browse/SPARK-1698
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Fix For: 1.0.0


 Use the shade plugin to create a big JAR with all the dependencies can cause 
 a few problems
 1. Missing jar's meta information
 2. Some file is covered, eg: plugin.xml
 3. Different versions of the jar may co-exist
 4. Too big, java 6 does not support



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-1556) jets3t dep doesn't update properly with newer Hadoop versions

2014-05-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1556:
-

Comment: was deleted

(was: Actually, why does Spark have a direct dependency on jets3t at all? it is 
not used directly in the code.

If it's only needed at runtime, it can/should be declared that way. But if the 
reason it's there is just for Hadoop, then of course hadoop-client is already 
bringing it in, and should be allowed to bring in the version it wants.)

 jets3t dep doesn't update properly with newer Hadoop versions
 -

 Key: SPARK-1556
 URL: https://issues.apache.org/jira/browse/SPARK-1556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu
Priority: Blocker
 Fix For: 1.0.0


 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines 
 S3ServiceException/ServiceException is introduced, however, Spark still 
 relies on Jet3st 0.7.x which has no definition of these classes
 What I met is that 
 [code]
 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
   at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
   at

[jira] [Commented] (SPARK-1520) Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6

2014-05-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13989098#comment-13989098
 ] 

Sean Owen commented on SPARK-1520:
--

[~pwendell] On this note, I wonder if it's also best to make Jenkins build with 
Java 6? I'm not quite sure if it catches things like this, but catches things 
with similar roots. I had a request open at 
https://issues.apache.org/jira/browse/SPARK-1437 but it's a Jenkins change 
rather than a code change.

 Assembly Jar with more than 65536 files won't work when compiled on  JDK7 and 
 run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 h1. Isolation and Cause
 The package-time behavior of Java 6 and 7 differ with respect to the format 
 used for jar files:
 ||Number of entries||JDK 6||JDK 7||
 |= 65536|zip|zip|
 | 65536|zip*|zip64|
 zip* is a workaround for the original zip format that [described in 
 JDK-6828461|https://bugs.openjdk.java.net/browse/JDK-4828461] that allows 
 some versions of Java 6 to support larger assembly jars.
 The Scala libraries we depend on have added a large number of classes which 
 bumped us over the limit. This causes the Java 7 packaging to not work with 
 Java 6. We can probably go back under the limit by clearing out some 
 accidental inclusion of FastUtil, but eventually we'll go over again.
 The real answer is to force people to build with JDK 6 if they want to run 
 Spark on JRE 6.
 -I've found that if I just unpack and re-pack the jar (using `jar`) it always 
 works:-
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}
 -I also noticed something of note. The Breeze package contains single 
 directories that have huge numbers of files in them (e.g. 2000+ class files 
 in one directory). It's possible we are hitting some weird bugs/corner cases 
 with compatibility of the internal storage format of the jar itself.-
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1727) Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs

2014-05-05 Thread Sean Owen (JIRA)

Sean Owen created SPARK-1727:


 Summary: Correct small compile errors, typos, and markdown issues 
in (primarly) MLlib docs
 Key: SPARK-1727
 URL: https://issues.apache.org/jira/browse/SPARK-1727
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.1
Reporter: Sean Owen
Priority: Minor


While play-testing the Scala and Java code examples in the MLlib docs, I 
noticed a number of small compile errors, and some typos. This led to finding 
and fixing a few similar items in other docs. 

Then in the course of building the site docs to check the result, I found a few 
small suggestions for the build instructions. I also found a few more 
formatting and markdown issues uncovered when I accidentally used maruku 
instead of kramdown.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994990#comment-13994990
 ] 

Sean Owen commented on SPARK-1802:
--

[~pwendell] You can see my start on it here:

https://github.com/srowen/spark/commits/SPARK-1802
https://github.com/srowen/spark/commit/a856604cfc67cb58146ada01fda6dbbb2515fa00

This resolves the new issues you note in your diff.


Next issue is that hive-exec, quite awfully, includes a copy of all of its 
transitive dependencies in its artifact. See 
https://issues.apache.org/jira/browse/HIVE-5733 and note the warnings you'll 
get during assembly:

{code}
[WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping 
classes: 
[WARNING]   - org.apache.thrift.transport.TSaslTransport$SaslResponse
...
{code}

hive-exec is in fact used in this module. Aside from actual surgery on the 
artifact with the shade plugin, you can't control the dependencies as a result. 
This may be simply the best that can be done right now. If it has worked, it 
has worked.


Am I right that the datanucleus JARs *are* meant to be in the assembly, only 
for the Hive build?
https://github.com/apache/spark/pull/688
https://github.com/apache/spark/pull/610

That's good if so since that's what your diff shows.


Finally, while we're here, I note that there are still a few JAR conflicts that 
turn up when you build the assembly *without* Hive. (I'm going to ignore 
conflicts in examples; these can be cleaned up but aren't really a big deal 
given its nature.)  We could touch those up too.

This is in the normal build (and I know how to zap most of this problem):
{code}
[WARNING] commons-beanutils-core-1.8.0.jar, commons-beanutils-1.7.0.jar define 
82 overlappping classes: 
{code}

These turn up in the Hadoop 2.x + YARN build:
{code}
[WARNING] servlet-api-2.5.jar, javax.servlet-3.0.0.v201112011016.jar define 42 
overlappping classes: 
...
[WARNING] jcl-over-slf4j-1.7.5.jar, commons-logging-1.1.3.jar define 6 
overlappping classes: 
...
[WARNING] activation-1.1.jar, javax.activation-1.1.0.v201105071233.jar define 
17 overlappping classes: 
...
[WARNING] servlet-api-2.5.jar, javax.servlet-3.0.0.v201112011016.jar define 42 
overlappping classes: 
{code}

These should be easy to track down. Shall I?

 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1802:
-

Attachment: hive-exec-jar-problems.txt

 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0

 Attachments: hive-exec-jar-problems.txt


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995815#comment-13995815
 ] 

Sean Owen commented on SPARK-1802:
--

I looked further into just what might go wrong by including hive-exec into the 
assembly, since it includes its dependencies directly (i.e. Maven can't manage 
around it.)

Attached is a full dump of the conflicts.

The ones that are potential issues appear to be the following, and one looks 
like it could be a deal-breaker -- protobuf -- since it's neither forwards nor 
backwards compatible. That is, I recommend testing this assembly with an older 
Hadoop that needs 2.4.1 and see if it croaks.

The rest might be worked around but need some additional mojo to make sure the 
right version wins in the packaging.

Certainly having hive-exec in the build is making me queasy!


[WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping 
classes: 

HBase includes libthrift-0.8.0, but it's in examples, and so figure this is 
ignorable.


[WARNING] hive-exec-0.12.0.jar, commons-lang-2.4.jar define 2 overlappping 
classes: 

Probably ignorable, but we have to make sure commons-lang-3.3.2 'wins' in the 
build.


[WARNING] hive-exec-0.12.0.jar, jackson-core-asl-1.9.11.jar define 117 
overlappping classes: 
[WARNING] hive-exec-0.12.0.jar, jackson-mapper-asl-1.8.8.jar define 432 
overlappping classes: 

Believe this are ignorable. (Not sure why the jackson versions are mismatched? 
another todo)


[WARNING] hive-exec-0.12.0.jar, guava-14.0.1.jar define 1087 overlappping 
classes: 

Should be OK. Hive uses 11.0.2 like Hadoop; the build is already taking that 
particular risk. We need 14.0.1 to win.


[WARNING] hive-exec-0.12.0.jar, protobuf-java-2.4.1.jar define 204 overlappping 
classes: 

Oof. Hive has protobuf 2.5.0. This has got to be a problem for older Hadoop 
builds?



 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1760) mvn -Dsuites=* test throw an ClassNotFoundException

2014-05-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993493#comment-13993493
 ] 

Sean Owen commented on SPARK-1760:
--

If `wildcardSuites` lets you invoke specific suites across the whole project, 
then that sounds like an ideal solution. If it works then I'd propose that as a 
small doc change?

  mvn  -Dsuites=*  test throw an ClassNotFoundException
 --

 Key: SPARK-1760
 URL: https://issues.apache.org/jira/browse/SPARK-1760
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Fix For: 1.0.0


 {{mvn -Dhadoop.version=0.23.9 -Phadoop-0.23 
 -Dsuites=org.apache.spark.repl.ReplSuite test}} = 
 {code}
 *** RUN ABORTED ***
   java.lang.ClassNotFoundException: org.apache.spark.repl.ReplSuite
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1470)
   at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1469)
   at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
   at scala.collection.immutable.List.foreach(List.scala:318)
   ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2014-05-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995815#comment-13995815
 ] 

Sean Owen edited comment on SPARK-1802 at 5/13/14 11:27 AM:


(Edited to fix comment about protobuf versions)

I looked further into just what might go wrong by including hive-exec into the 
assembly, since it includes its dependencies directly (i.e. Maven can't manage 
around it.)

Attached is a full dump of the conflicts.

The ones that are potential issues appear to be the following, and one looks 
like it could be a deal-breaker -- protobuf -- since it's neither forwards nor 
backwards compatible. That is, I recommend testing this assembly with an 
*newer* Hadoop that needs 2.5 and see if it croaks.

The rest might be worked around but need some additional mojo to make sure the 
right version wins in the packaging.

Certainly having hive-exec in the build is making me queasy!


[WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping 
classes: 

HBase includes libthrift-0.8.0, but it's in examples, and so figure this is 
ignorable.


[WARNING] hive-exec-0.12.0.jar, commons-lang-2.4.jar define 2 overlappping 
classes: 

Probably ignorable, but we have to make sure commons-lang-3.3.2 'wins' in the 
build.


[WARNING] hive-exec-0.12.0.jar, jackson-core-asl-1.9.11.jar define 117 
overlappping classes: 
[WARNING] hive-exec-0.12.0.jar, jackson-mapper-asl-1.8.8.jar define 432 
overlappping classes: 

Believe this are ignorable. (Not sure why the jackson versions are mismatched? 
another todo)


[WARNING] hive-exec-0.12.0.jar, guava-14.0.1.jar define 1087 overlappping 
classes: 

Should be OK. Hive uses 11.0.2 like Hadoop; the build is already taking that 
particular risk. We need 14.0.1 to win.


[WARNING] hive-exec-0.12.0.jar, protobuf-java-2.4.1.jar define 204 overlappping 
classes: 

Oof. Hive has protobuf *2.4.1*. This has got to be a problem for newer Hadoop 
builds?

(Edited to fix comment about protobuf versions)


was (Author: srowen):
I looked further into just what might go wrong by including hive-exec into the 
assembly, since it includes its dependencies directly (i.e. Maven can't manage 
around it.)

Attached is a full dump of the conflicts.

The ones that are potential issues appear to be the following, and one looks 
like it could be a deal-breaker -- protobuf -- since it's neither forwards nor 
backwards compatible. That is, I recommend testing this assembly with an older 
Hadoop that needs 2.4.1 and see if it croaks.

The rest might be worked around but need some additional mojo to make sure the 
right version wins in the packaging.

Certainly having hive-exec in the build is making me queasy!


[WARNING] hive-exec-0.12.0.jar, libthrift-0.9.0.jar define 153 overlappping 
classes: 

HBase includes libthrift-0.8.0, but it's in examples, and so figure this is 
ignorable.


[WARNING] hive-exec-0.12.0.jar, commons-lang-2.4.jar define 2 overlappping 
classes: 

Probably ignorable, but we have to make sure commons-lang-3.3.2 'wins' in the 
build.


[WARNING] hive-exec-0.12.0.jar, jackson-core-asl-1.9.11.jar define 117 
overlappping classes: 
[WARNING] hive-exec-0.12.0.jar, jackson-mapper-asl-1.8.8.jar define 432 
overlappping classes: 

Believe this are ignorable. (Not sure why the jackson versions are mismatched? 
another todo)


[WARNING] hive-exec-0.12.0.jar, guava-14.0.1.jar define 1087 overlappping 
classes: 

Should be OK. Hive uses 11.0.2 like Hadoop; the build is already taking that 
particular risk. We need 14.0.1 to win.


[WARNING] hive-exec-0.12.0.jar, protobuf-java-2.4.1.jar define 204 overlappping 
classes: 

Oof. Hive has protobuf 2.5.0. This has got to be a problem for older Hadoop 
builds?



 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0

 Attachments: hive-exec-jar-problems.txt


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar

[jira] [Commented] (SPARK-1789) Multiple versions of Netty dependencies cause FlumeStreamSuite failure

2014-05-14 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997317#comment-13997317
]

Sean Owen commented on SPARK-1789:
--

I don't have any info either way on that. Later is always better no? probably
OK to consider post-1.0?

The issue here was to do with Netty, and the comment about Akka that I quoted
was really meant to suggest that it was Netty (as it happens being imported by
Akka) that was relevant.

Multiple versions of Netty dependencies cause FlumeStreamSuite failure
--

Key: SPARK-1789
URL: https://issues.apache.org/jira/browse/SPARK-1789
Project: Spark
Issue Type: Bug
Components: Build
Affects Versions: 0.9.1
Reporter: Sean Owen
Assignee: Sean Owen
Labels: flume, netty, test
Fix For: 1.0.0

TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly
resolved and will resolve a test failure.
I hit the error described at
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html
while running FlumeStreamingSuite, and have for a short while (is it just
me?)
velvia notes:
I have found a workaround. If you add akka 2.2.4 to your dependencies, then
everything works, probably because akka 2.2.4 brings in newer version of
Jetty.
There are at least 3 versions of Netty in play in the build:
- the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and
that is the immediate problem
- the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
- but, Spark Core directly uses io.netty:netty-all:4.0.17.Final
The POMs try to exclude other versions of netty, but are excluding
org.jboss.netty:netty, when in fact older versions of io.netty:netty (not
netty-all) are also an issue.
The org.jboss.netty:netty excludes are largely unnecessary. I replaced many
of them with io.netty:netty exclusions until everything agreed on
io.netty:netty-all:4.0.17.Final.
But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x.
Down-grading to 3.6.6.Final across the board made some Spark code not compile.
If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to
work. Part of the reason seems to be that Netty 3.x used the old
`org.jboss.netty` packages. This is less than ideal, but is no worse than the
current situation.
So this PR resolves the issue and improves the JAR hell, even if it leaves
the existing theoretical Netty 3-vs-4 conflict:
- Remove org.jboss.netty excludes where possible, for clarity; they're not
needed except with Hadoop artifacts
- Add io.netty:netty excludes where needed -- except, let akka keep its
io.netty:netty
- Change a bit of test code that actually depended on Netty 3.x, to use 4.x
equivalent
- Update SBT build accordingly
A better change would be to update Akka far enough such that it agrees on
Netty 4.x, but I don't know if that's feasible.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1827) LICENSE and NOTICE files need a refresh to contain transitive dependency info

2014-05-14 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997458#comment-13997458
]

Sean Owen commented on SPARK-1827:
--

LICENSE and NOTICE policy is explained here:

http://www.apache.org/dev/licensing-howto.html
http://www.apache.org/legal/3party.html

This leads to the following changes.

First, this change enables two extensions to maven-shade-plugin in assembly/
that will try to include and merge all NOTICE and LICENSE files. This can't
hurt.

This generates a consolidated NOTICE file that I manually added to NOTICE.

Next, a list of all dependencies and their licenses was generated:

mvn ... license:aggregate-add-third-party

to create: target/generated-sources/license/THIRD-PARTY.txt

Each dependency is listed with one or more licenses. Determine the
most-compatible license for each if there is more than one.

For unknown license dependencies, I manually evaluateD their license. Many
are actually Apache projects or components of projects covered already. The
only non-trivial one was Colt, which has its own (compatible) license.

I ignored Apache-licensed and public domain dependencies as these require no
further action (beyond NOTICE above).

BSD and MIT licenses (permissive Category A licenses) are evidently supposed to
be mentioned in LICENSE, so I added a section without output from the
THIRD-PARTY.txt file appropriately.

Everything else, Category B licenses, are evidently mentioned in NOTICE (?)
Same there.

LICENSE contained some license statements for source code that is
redistributed. I left this as I think that is the right place to put it.

LICENSE and NOTICE files need a refresh to contain transitive dependency info
-

Key: SPARK-1827
URL: https://issues.apache.org/jira/browse/SPARK-1827
Project: Spark
Issue Type: Bug
Components: Build
Affects Versions: 0.9.1
Reporter: Sean Owen
Priority: Blocker
Fix For: 1.0.0

(Pardon marking it a blocker, but think it needs doing before 1.0 per chat
with [~pwendell])
The LICENSE and NOTICE files need to cover all transitive dependencies, since
these are all distributed in the assembly jar. (c.f.
http://www.apache.org/dev/licensing-howto.html )
I don't believe the current files cover everything. It's possible to
mostly-automatically generate these. I will generate this and propose a patch
to both today.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1827) LICENSE and NOTICE files need a refresh to contain transitive dependency info

2014-05-14 Thread Sean Owen (JIRA)

Sean Owen created SPARK-1827:


 Summary: LICENSE and NOTICE files need a refresh to contain 
transitive dependency info
 Key: SPARK-1827
 URL: https://issues.apache.org/jira/browse/SPARK-1827
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Sean Owen
Priority: Blocker


(Pardon marking it a blocker, but think it needs doing before 1.0 per chat with 
[~pwendell])

The LICENSE and NOTICE files need to cover all transitive dependencies, since 
these are all distributed in the assembly jar. (c.f. 
http://www.apache.org/dev/licensing-howto.html )

I don't believe the current files cover everything. It's possible to 
mostly-automatically generate these. I will generate this and propose a patch 
to both today.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1760) mvn -Dsuites=* test throw an ClassNotFoundException

2014-05-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992837#comment-13992837
 ] 

Sean Owen commented on SPARK-1760:
--

Yeah I think you would have to run this from the `repl/` module for it it work. 
At least it does for me, and that makes sense. I think the docs just need to 
note that, at: https://spark.apache.org/docs/0.9.1/building-with-maven.html  
(and that suite name can be updated to include org.apache)

  mvn  -Dsuites=*  test throw an ClassNotFoundException
 --

 Key: SPARK-1760
 URL: https://issues.apache.org/jira/browse/SPARK-1760
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li

 {{mvn -Dhadoop.version=0.23.9 -Phadoop-0.23 
 -Dsuites=org.apache.spark.repl.ReplSuite test}} = 
 {code}
 *** RUN ABORTED ***
   java.lang.ClassNotFoundException: org.apache.spark.repl.ReplSuite
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1470)
   at org.scalatest.tools.Runner$$anonfun$21.apply(Runner.scala:1469)
   at 
 scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
   at scala.collection.immutable.List.foreach(List.scala:318)
   ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1575) failing tests with master branch

2014-05-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996955#comment-13996955
 ] 

Sean Owen commented on SPARK-1575:
--

For what it's worth, I no longer see this failure I believe this has been 
resolved by other changes along the way.

 failing tests with master branch 
 -

 Key: SPARK-1575
 URL: https://issues.apache.org/jira/browse/SPARK-1575
 Project: Spark
  Issue Type: Test
Reporter: Nishkam Ravi
Priority: Blocker

 Built the master branch against Hadoop version 2.3.0-cdh5.0.0 with 
 SPARK_YARN=true. sbt tests don't go through successfully (tried multiple 
 runs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1787) Build failure on JDK8 :: SBT fails to load build configuration file

2014-05-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998492#comment-13998492
 ] 

Sean Owen commented on SPARK-1787:
--

Duplicate of https://issues.apache.org/jira/browse/SPARK-1444 it appears

 Build failure on JDK8 :: SBT fails to load build configuration file
 ---

 Key: SPARK-1787
 URL: https://issues.apache.org/jira/browse/SPARK-1787
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 0.9.0
 Environment: JDK8
 Scala 2.10.X
 SBT 0.12.X
Reporter: Richard Gomes
Priority: Minor

 SBT fails to build under JDK8.
 Please find steps to reproduce the error below:
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ uname -a
 Linux terra 3.13-1-amd64 #1 SMP Debian 3.13.10-1 (2014-04-15) x86_64 GNU/Linux
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ java -version
 java version 1.8.0_05
 Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ scala -version
 Scala code runner version 2.10.3 -- Copyright 2002-2013, LAMP/EPFL
 (j8s10)rgomes@terra:~/workspace/spark-0.9.1$ sbt/sbt clean
 Launching sbt from sbt/sbt-launch-0.12.4.jar
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=350m; 
 support was removed in 8.0
 [info] Loading project definition from 
 /home/rgomes/workspace/spark-0.9.1/project/project
 [info] Compiling 1 Scala source to 
 /home/rgomes/workspace/spark-0.9.1/project/project/target/scala-2.9.2/sbt-0.12/classes...
 [error] error while loading CharSequence, class file 
 '/opt/developer/jdk1.8.0_05/jre/lib/rt.jar(java/lang/CharSequence.class)' is 
 broken
 [error] (bad constant pool tag 15 at byte 1501)
 [error] error while loading Comparator, class file 
 '/opt/developer/jdk1.8.0_05/jre/lib/rt.jar(java/util/Comparator.class)' is 
 broken
 [error] (bad constant pool tag 15 at byte 5003)
 [error] two errors found
 [error] (compile:compile) Compilation failed
 Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? q



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-05-15 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998494#comment-13998494
]

Sean Owen commented on SPARK-1473:
--

I believe these types of thing were more the goals of the MLI and MLbase
projects rather than MLlib? I don't know the status of those. For what it's
worth I think these are very useful things but in a separate 'layer' above
something like MLlib.

Feature selection for high dimensional datasets
---

Key: SPARK-1473
URL: https://issues.apache.org/jira/browse/SPARK-1473
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Ignacio Zendejas
Priority: Minor
Labels: features
Fix For: 1.1.0

For classification tasks involving large feature spaces in the order of tens
of thousands or higher (e.g., text classification with n-grams, where n 1),
it is often useful to rank and filter features that are irrelevant thereby
reducing the feature space by at least one or two orders of magnitude without
impacting performance on key evaluation metrics (accuracy/precision/recall).
A feature evaluation interface which is flexible needs to be designed and at
least two methods should be implemented with Information Gain being a
priority as it has been shown to be amongst the most reliable.
Special consideration should be taken in the design to account for wrapper
methods (see research papers below) which are more practical for lower
dimensional data.
Relevant research:
* Brown, G., Pocock, A., Zhao, M. J., Luján, M. (2012). Conditional
likelihood maximisation: a unifying framework for information theoretic
feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
* Forman, George. An extensive empirical study of feature selection metrics
for text classification. The Journal of machine learning research 3 (2003):
1289-1305.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1875) NoClassDefFoundError: StringUtils when building against Hadoop 1

2014-05-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001424#comment-14001424
 ] 

Sean Owen commented on SPARK-1875:
--

(That's correct that commons-lang and commons-lang3 use separate packages.)

 NoClassDefFoundError: StringUtils when building against Hadoop 1
 

 Key: SPARK-1875
 URL: https://issues.apache.org/jira/browse/SPARK-1875
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Guoqiang Li
Priority: Blocker
 Fix For: 1.0.0


 Maybe I missed something, but after building an assembly with Hadoop 1.2.1 
 and Hive enabled, if I go into it and run spark-shell, I get this:
 {code}
 java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
   at 
 org.apache.hadoop.metrics2.lib.MetricMutableStat.init(MetricMutableStat.java:59)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:75)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.init(MetricsSystemImpl.java:120)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
   at 
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:216)
   at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
   at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
   at 
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:79)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:226)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-23 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007339#comment-14007339
 ] 

Sean Owen commented on SPARK-1867:
--

Yes, the 'mr1' artifacts are for when you are *not* using YARN. These are 
unusual to use for CDH5, and you would not need those versions.

The stock Spark artifacts are for Hadoop 1, not Hadoop 2. It can be built for 
Hadoop 2 and installed locally if you like. You can use the matched CDH5 
version, which is of course made for Hadoop 2, by targeting '0.9.0-cdh5.0.1' 
for example.

(I don't have a release schedule but assume some later version will be released 
with 5.1 of course)

There shouldn't be any trial an error to it, if you express as dependencies all 
the things you directly use. For example you say your app uses HBase but I see 
no dependence on the client libraries. This is nothing to do with Spark per se.

There shouldn't be any trial and error about it, or else you're probably trying 
to do something the wrong way around. What classes do you expect you're looking 
for?



 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008077#comment-14008077
 ] 

Sean Owen commented on SPARK-1867:
--

I could be way wrong here, partly as a function of not knowing the context 
entirely, but if things are being fixed by including bits and pieces of 
CDH-flavored M/R dependency then you may just be patching over the fact that 
you really need to use the Hadoop2/CDH-flavored Spark artifacts to begin with. 
If anyone thinks it's useful to discuss how this is done normally, offline, I'd 
be happy to say what I know, given more info.

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core % 7.0.5,
 net.minidev % json-smart % 1.2



--
This message was sent

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-25 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008329#comment-14008329
]

Sean Owen commented on SPARK-1867:
--

Well, for example, MapReduce classes are from one of the Hadoop
hadoop-mapreduce-* artifacts. Generally speaking, apps depend on client
artifacts, which then bring in other necessary dependencies for you. You don't
manually add dependencies for your dependencies; that's what Maven is doing for
you. (Although we all know some artifacts don't express their dependencies 100%
correctly all the time...)

That said, the upstream projects do change over time. For example, in Hadoop
1.x, there was a hadoop-core artifact. In Hadoop 2.x things were broken out
further, which is generally good, but now there is hadoop-common. But you
don't depend on that to use Hadoop, you would use hadoop-mapreduce-client for
example if using MapReduce. Same is true of HBase I imagine.

CDH and other distributions do not move things around -- upstream projects do.

Simple use cases are simple to get working. For example if you're using Spark's
core, you just depend on the one spark-core artifact. (With the caveat that you
need to depend on a different Hadoop 2-compatible artifacts if you use Hadoop 2
-- still one artifacts, but may be 0.9.0-cdh5.0.1 for example.) If you use
HBase, you depend on whatever the hbase client artifact is too.

There are gotchas here to be sure, and bits of project-specific knowledge that
are required, but it's not nearly random guesswork. Maybe you can separately
say what you are trying to depend on and people can confirm the few direct
dependencies you should depend on. I suspect there are some fundamental
problems, like depending on the wrong Spark artifact. It sounds like you are
manually trying to replace all of Spark's Hadoop 1 dependencies with Hadoop 2
dependencies, and that way lies madness. Use the one build for Hadoop 2.

Spark Documentation Error causes java.lang.IllegalStateException: unread
block data
---

Key: SPARK-1867
URL: https://issues.apache.org/jira/browse/SPARK-1867
Project: Spark
Issue Type: Bug
Reporter: sam

I've employed two System Administrators on a contract basis (for quite a bit
of money), and both contractors have independently hit the following
exception. What we are doing is:
1. Installing Spark 0.9.1 according to the documentation on the website,
along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
2. Building a fat jar with a Spark app with sbt then trying to run it on the
cluster
I've also included code snippets, and sbt deps at the bottom.
When I've Googled this, there seems to be two somewhat vague responses:
a) Mismatching spark versions on nodes/user code
b) Need to add more jars to the SparkConf
Now I know that (b) is not the problem having successfully run the same code
on other clusters while only including one jar (it's a fat jar).
But I have no idea how to check for (a) - it appears Spark doesn't have any
version checks or anything - it would be nice if it checked versions and
threw a mismatching version exception: you have user code using version X
and node Y has version Z.
I would be very grateful for advice on this.
The exception:
Exception in thread main org.apache.spark.SparkException: Job aborted: Task
0.0:1 failed 32 times (most recent failure: Exception failure:
java.lang.IllegalStateException: unread block data)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at

[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency

2014-05-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009472#comment-14009472
 ] 

Sean Owen commented on SPARK-1935:
--

Yeah, I think  this is a matter of Maven's nearest-first vs SBT's latest-first 
conflict resolution strategy.
It should be safe to manually manage this to 1.5, I believe.

 Explicitly add commons-codec 1.4 as a dependency
 

 Key: SPARK-1935
 URL: https://issues.apache.org/jira/browse/SPARK-1935
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Yin Huai
Priority: Minor

 Right now, commons-codec is a transitive dependency. When Spark is built by 
 maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
 older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
 problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010294#comment-14010294
 ] 

Sean Owen commented on SPARK-1518:
--

0.20.x stopped in early 2010. It is ancient.

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010322#comment-14010322
 ] 

Sean Owen commented on SPARK-1518:
--

RE: Hadoop versions, in my reckoning of the twisted world of Hadoop versions, 
the 0.23.x branch is still active and so is kind of later than 1.0.x. It may be 
easier to retain 0.23 compatibility than 1.0.x for example.

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-28 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010937#comment-14010937
]

Sean Owen commented on SPARK-1518:
--

Re: versioning one more time, really supporting a bunch of versions may get
costly. It's already tricky to manage two builds times YARN-or-not,
Hive-or-not, times 4 flavors of Hadoop. I doubt the assemblies are yet
problem-free in all cases.

In practice it look like one generic Hadoop 1, Hadoop 2, and CDH 4 release is
produced, and 1 set of Maven artifact. (PS again I am not sure Spark should
contain a CDH-specific distribution? realizing it's really a proxy for a
particular Hadoop combo. Same goes for a MapR profile, which is really for
vendors to maintain) That means right now you can't build a Spark app for
anything but Hadoop 1.x with Maven, without installing it yourself, and there's
not an official distro for anything but two major Hadoop versions. Support for
niche versions isn't really there or promised anyway, and fleshing out
support may make doing so pretty burdensome.

There is no suggested action here; if anything I suggest that the right thing
is to add Maven artifacts with classifiers, add a few binary artifacts,
subtract a few vendor artifacts, but this is a different action.

Spark master doesn't compile against hadoop-common trunk

Key: SPARK-1518
URL: https://issues.apache.org/jira/browse/SPARK-1518
Project: Spark
Issue Type: Bug
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

FSDataOutputStream::sync() has disappeared from trunk in Hadoop;
FileLogger.scala is calling it.
I've changed it locally to hsync() so I can compile the code, but haven't
checked yet whether those are equivalent. hsync() seems to have been there
forever, so it hopefully works with all versions Spark cares about.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1950) spark on yarn can't start

2014-05-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011330#comment-14011330
 ] 

Sean Owen commented on SPARK-1950:
--

(Looks like you opened this twice? 
https://issues.apache.org/jira/browse/SPARK-1951 )

 spark on yarn can't start 
 --

 Key: SPARK-1950
 URL: https://issues.apache.org/jira/browse/SPARK-1950
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Guoqiang Li
Priority: Blocker

 {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
 /input/lbs/recommend/toona/spark/conf  toona-assembly.jar 20140521}}throw an 
 exception:
 {code}
 Exception in thread main java.io.FileNotFoundException: File 
 file:/input/lbs/recommend/toona/spark/conf does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
   at 
 org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230)
   at 
 org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39)
   at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
   at org.apache.spark.deploy.yarn.Client.run(Client.scala:96)
   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:186)
   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}
 {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
 hdfs://10dian72:8020/input/lbs/recommend/toona/spark/conf  toona-assembly.jar 
 20140521}} work.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-28 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011897#comment-14011897
]

Sean Owen commented on SPARK-1518:
--

they write their app against the Spark API's in Maven central (they can do
this no matter which cluster they want to run on)

Yeah this is the issue. OK, if I compile against Spark artifacts as a runtime
dependency and submit an app to the cluster, it should be OK no matter what
build of Spark is running. The binding from Spark to Hadoop is hidden from the
app.

I am thinking of the case where I want to build an app that is a client of
Spark -- embedding it. Then I am including the client of Hadoop for example. I
have to match my cluster than and there is no Hadoop 2 Spark artifact.

Am I missing something big here? that's my premise about why there would ever
be a need for different artifacts. It's the same use case as in Sandy's blog:
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/

Spark master doesn't compile against hadoop-common trunk

Key: SPARK-1518
URL: https://issues.apache.org/jira/browse/SPARK-1518
Project: Spark
Issue Type: Bug
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012309#comment-14012309
 ] 

Sean Owen commented on SPARK-1867:
--

There is no hadoop-io module. Modules are subcomponents of the components 
distributed in something like CDH, and are not versioned independently, so you 
would not find them described on that page. That is, if CDH X.Y includes Hadoop 
Z.W then it includes version Z.W of all Hadoop's modules.

LongWritable was in hadoop-core in Hadoop 1.x, and is in hadoop-common in 2.x. 
Just about everything depends on these modules, so you should not find yourself 
missing them at runtime if you are depending on something like hadoop-client.

There isn't a org.apache.hadoop.io.CombineTextInputFormat, not that I can see 
-- where do you see that referenced?
There is org.apache.hadoop.mapreduce.lib.CombineTextInputFormat although it 
looks like it appeared from about Hadoop 2.1 
(https://issues.apache.org/jira/browse/MAPREDUCE-5069) You would not be able to 
use it if you are running on Hadoop 1, and wouldn't find it if you depend on 
Hadoop 1 modules.

It appears to be the in mapreduce-client-core module, but again, you wouldn't 
need to depend on that directly. That gets pulled in from hadoop-client, via 
hadoop-mapreduce-client. I can see it doing so when you build Spark for Hadoop 
2, for example.

I'm still not clear why you are needing to hunt for individual classes? Maybe 
one of those is just due to a package typo.



 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012434#comment-14012434
 ] 

Sean Owen commented on SPARK-1867:
--

Something else is up; those are equivalent in Java and you couldn't have 
imported two symbols with the same name. I could look at the file offline if 
you think I might spot something else.

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core % 7.0.5,
 net.minidev % json-smart % 1.2



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012552#comment-14012552
 ] 

Sean Owen commented on SPARK-1518:
--

Heh, I think the essence is: at least one more separate Maven artifact, under a 
different classifier, for Hadoop 2.x builds. If you package that, you get Spark 
and everything it needs to work against a Hadoop 2 cluster. Yeah I see that 
you're suggesting various ways to push the app to the cluster, where it can 
bind to the right version of things, and that may be the right-est way to think 
about this. I had envisioned running a stand-alone app on a machine that is not 
part of the cluster, that is a client of it, and this means packaging in the 
right Hadoop client dependencies, and Spark already declares how it wants to 
include these various Hadoop client versions -- it's more than just including 
hadoop-client -- so wanted to leverage that. Let's see if this actually turns 
out to be a broader request though.

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1973) Add randomSplit to JavaRDD (with tests, and tidy Java tests)

2014-05-30 Thread Sean Owen (JIRA)

Sean Owen created SPARK-1973:


 Summary: Add randomSplit to JavaRDD (with tests, and tidy Java 
tests)
 Key: SPARK-1973
 URL: https://issues.apache.org/jira/browse/SPARK-1973
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sean Owen
Priority: Minor


I'd like to use randomSplit through the Java API, and would like to add a 
convenience wrapper for this method to JavaRDD. This is fairly trivial. (In 
fact, is the intent that JavaRDD not wrap every RDD method? and that sometimes 
users should just use JavaRDD.wrapRDD()?)

Along the way, I added tests for it, and also touched up the Java API test 
style and behavior. This is maybe the more useful part of this small change.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013591#comment-14013591
 ] 

Sean Owen commented on SPARK-1518:
--

Sorry for one more message here to reply to Matei -- yes it's the laptop use 
case except I'd describe that as a not-uncommon production deployment! it's the 
embedded-client scenario. It is more than adding one hadoop-client dependency, 
because you need to emulate the excludes, etc that Spark has to. (But yeah then 
it works.) I agree supporting a bunch of Hadoop versions gets painful, as a 
result. This was why I was suggesting way up top that supporting old versions 
may become more trouble than its worht.

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1974) Most examples fail at startup because spark.master is not set

2014-05-30 Thread Sean Owen (JIRA)

Sean Owen created SPARK-1974:


 Summary: Most examples fail at startup because spark.master is not 
set
 Key: SPARK-1974
 URL: https://issues.apache.org/jira/browse/SPARK-1974
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.0.0
Reporter: Sean Owen


Most example code has a few lines like:

{code}
val sparkConf = new SparkConf().setAppName(Foo)
val sc = new SparkContext(sparkConf)
{code}

The SparkContext constructor throws a SparkException if spark.master is not set 
though, so this fails immediately.

What would be preferred -- let spark.master default to local\[2\]? or change 
all examples to call:

{code}
new SparkContext(local[2], Foo)
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1998) SparkFlumeEvent with body bigger than 1020 bytes are not read properly

2014-06-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016327#comment-14016327
 ] 

Sean Owen commented on SPARK-1998:
--

(Can you make a pull request versus a patch here? this looks important, I know 
we have some Flume streaming users.)

 SparkFlumeEvent with body bigger than 1020 bytes are not read properly
 --

 Key: SPARK-1998
 URL: https://issues.apache.org/jira/browse/SPARK-1998
 Project: Spark
  Issue Type: Bug
Reporter: sun.sam
 Attachments: patch.diff






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1974) Most examples fail at startup because spark.master is not set

2014-06-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1974.
--

   Resolution: Not a Problem
Fix Version/s: (was: 1.0.1)

Decision was to not modify examples, but to possibly set a spark.master 
default. See also https://issues.apache.org/jira/browse/SPARK-1906

 Most examples fail at startup because spark.master is not set
 -

 Key: SPARK-1974
 URL: https://issues.apache.org/jira/browse/SPARK-1974
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.0.0
Reporter: Sean Owen

 Most example code has a few lines like:
 {code}
 val sparkConf = new SparkConf().setAppName(Foo)
 val sc = new SparkContext(sparkConf)
 {code}
 The SparkContext constructor throws a SparkException if spark.master is not 
 set though, so this fails immediately.
 What would be preferred -- let spark.master default to local\[2\]? or 
 change all examples to call:
 {code}
 new SparkContext(local[2], Foo)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue

2014-06-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017528#comment-14017528
 ] 

Sean Owen commented on SPARK-2018:
--

The meaning of the error is that Java thinks two serializable classes are not 
mutually compatible. This is because two different serialVersioUIDs get 
computed for two copies of what may be the same class. If I understand you 
correctly, you are communicating between different JVM versions, or reading 
one's output from the other? I don't think it's guaranteed that the 
auto-generated serialVersionUID will be the same. If so, it's nothing to do 
with big-endian-ness per se. Does it happen entirely within the same machine / 
JVM? 

 Big-Endian (IBM Power7)  Spark Serialization issue
 --

 Key: SPARK-2018
 URL: https://issues.apache.org/jira/browse/SPARK-2018
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: hardware : IBM Power7
 OS:Linux version 2.6.32-358.el6.ppc64 
 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013
 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 
 20130617_152572 (JIT enabled, AOT enabled)
 Hadoop:Hadoop-0.2.3-CDH5.0
 Spark:Spark-1.0.0 or Spark-0.9.1
 spark-env.sh:
 export JAVA_HOME=/opt/ibm/java-ppc64-70/
 export SPARK_MASTER_IP=9.114.34.69
 export SPARK_WORKER_MEMORY=1m
 export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib
 export  STANDALONE_SPARK_MASTER_HOST=9.114.34.69
 #export SPARK_JAVA_OPTS=' -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n '
Reporter: Yanjie Gao

 We have an application run on Spark on Power7 System .
 But we meet an important issue about serialization.
 The example HdfsWordCount can meet the problem.
 ./bin/run-example  org.apache.spark.examples.streaming.HdfsWordCount 
 localdir
 We used Power7 (Big-Endian arch) and Redhat  6.4.
 Big-Endian  is the main cause since the example ran successfully in another 
 Power-based Little Endian setup.
 here is the exception stack and log:
 Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp 
 /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/
  -XX:MaxPermSize=128m  -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M 
 -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 
 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 
 app-20140604023054-
 
 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:22 INFO Remoting: Starting remoting
 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
 driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler
 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully 
 registered with driver
 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:24 INFO Remoting: Starting remoting
 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses

[jira] [Commented] (SPARK-2026) Maven hadoop* Profiles Should Set the expected Hadoop Version.

2014-06-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018512#comment-14018512
 ] 

Sean Owen commented on SPARK-2026:
--

A few people have mentioned and asked for this, especially as it helps the 
build work cleanly in IntelliJ. FWIW I would like this change too. Do you have 
a PR?

 Maven hadoop* Profiles Should Set the expected Hadoop Version.
 

 Key: SPARK-2026
 URL: https://issues.apache.org/jira/browse/SPARK-2026
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.0.0
Reporter: Bernardo Gomez Palacio

 The Maven Profiles that refer to _hadoopX_, e.g. hadoop2.4, should set the 
 expected _hadoop.version_.
 e.g.
 {code}
 profile
   idhadoop-2.4/id
   properties
 protobuf.version2.5.0/protobuf.version
 jets3t.version0.9.0/jets3t.version
   /properties
 /profile
 {code}
 as it is suggested
 {code}
 profile
   idhadoop-2.4/id
   properties
 hadoop.version2.4.0/hadoop.version
  yarn.version${hadoop.version}/yarn.version
 protobuf.version2.5.0/protobuf.version
 jets3t.version0.9.0/jets3t.version
   /properties
 /profile
 {code}
 Builds can still define the -Dhadoop.version option but this will correctly 
 default the Hadoop Version to the one that is expected according the profile 
 that is selected.
 e.g.
 {code}
 $ mvn -P hadoop-2.4,yarn clean compile
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-06-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018760#comment-14018760
 ] 

Sean Owen commented on SPARK-2019:
--

I believe that's coming with 5.1 but I don't know when that is scheduled. We 
can talk about issues like this offline -- really your best bet is support 
anyway.

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2090) spark-shell input text entry not showing on REPL

2014-06-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026354#comment-14026354
 ] 

Sean Owen commented on SPARK-2090:
--

I'm assuming it's specific to your env or config as I don't see this behavior, 
and haven't in the past, and assume others aren't seeing it. Have you enabled 
the SecurityManager? if so what settings? are there other errors? 

It looks like SimpleReader doesn't echo input, so that's that, but the question 
is what caused Permission denied from the standard SparkJLineReader.

Maybe you can temporarily modify that error log message to log the whole stack 
trace to see what it is?

 spark-shell input text entry not showing on REPL
 

 Key: SPARK-2090
 URL: https://issues.apache.org/jira/browse/SPARK-2090
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.0
 Environment: Ubuntu 14.04; Using Scala version 2.10.4 (Java 
 HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60)
Reporter: Richard Conway
Priority: Critical
  Labels: easyfix, patch
 Fix For: 1.0.0

   Original Estimate: 4h
  Remaining Estimate: 4h

 spark-shell doesn't allow text to be displayed on input 
 Failed to created SparkJLineReader: java.io.IOException: Permission denied
 Falling back to SimpleReader.
 The driver has 2 workers on 2 virtual machines and error free apart from the 
 above line so I think it may have something to do with the introduction of 
 the new SecurityManager.
 The upshot is that when you type nothing is displayed on the screen. For 
 example, type test at the scala prompt and you won't see the input but the 
 output will show.
 scala console:11: error: package test is not a value
   test
   ^



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2103) Java + Kafka + Spark Streaming NoSuchMethodError in java.lang.Object.init

2014-06-10 Thread Sean Owen (JIRA)

Sean Owen created SPARK-2103:


 Summary: Java + Kafka + Spark Streaming NoSuchMethodError in 
java.lang.Object.init
 Key: SPARK-2103
 URL: https://issues.apache.org/jira/browse/SPARK-2103
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Sean Owen


This has come up a few times, from user venki-kratos:
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-in-KafkaReciever-td2209.html

and I ran into it a few weeks ago:
http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3ccamassdlzs6ihctxepusphryxxa-wp26zgbxx83sm6niro0q...@mail.gmail.com%3E

and yesterday user mpieck:

{quote}
When I use the createStream method from the example class like
this:

KafkaUtils.createStream(jssc, zookeeper:port, test, topicMap);

everything is working fine, but when I explicitely specify message decoder
classes used in this method with another overloaded createStream method:

KafkaUtils.createStream(jssc, String.class, String.class,
StringDecoder.class, StringDecoder.class, props, topicMap,
StorageLevels.MEMORY_AND_DISK_2);

the applications stops with an error:

14/06/10 22:28:06 ERROR kafka.KafkaReceiver: Error receiving data
java.lang.NoSuchMethodException:
java.lang.Object.init(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Unknown Source)
at java.lang.Class.getConstructor(Unknown Source)
at
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:108)
at
org.apache.spark.streaming.dstream.NetworkReceiver.start(NetworkInputDStream.scala:126)
{quote}

Something is making it try to instantiate java.lang.Object as if it's a Decoder 
class.

I suspect that the problem is to do with
https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala#L148

{code}
implicit val keyCmd: Manifest[U] =
implicitly[Manifest[AnyRef]].asInstanceOf[Manifest[U]]
implicit val valueCmd: Manifest[T] =
implicitly[Manifest[AnyRef]].asInstanceOf[Manifest[T]]
{code}

... where U and T are key/value Decoder types. I don't know enough Scala to 
fully understand this, but is it possible this causes the reflective call later 
to lose the type and try to instantiate Object? The AnyRef made me wonder.

I am sorry to say I don't have a PR to suggest at this point.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2100) Allow users to disable Jetty Spark UI in local mode

2014-06-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032803#comment-14032803
 ] 

Sean Owen commented on SPARK-2100:
--

Tomcat and Jetty classes don't overlap -- do you mean the Servlet API classes? 
that's a different known issue.

 Allow users to disable Jetty Spark UI in local mode
 ---

 Key: SPARK-2100
 URL: https://issues.apache.org/jira/browse/SPARK-2100
 Project: Spark
  Issue Type: Improvement
Reporter: DB Tsai

 Since we want to use Spark hadoop APIs in local mode for design time to 
 explore the first couple hundred lines of data in HDFS. Also, we want to use 
 Spark in our tomcat application, so starting a jetty UI will make our tomcat 
 unhappy. In those scenarios, Spark UI is not necessary, and wasting resource. 
 As a result, for local mode, it's desirable that users are able to disable 
 the spark UI.
 Couple places I found where the jetty will be started.
 In SparkEnv.scala
 1) val broadcastManager = new BroadcastManager(isDriver, conf, 
 securityManager)
 2)  val httpFileServer = new HttpFileServer(securityManager)
 httpFileServer.initialize()
 I don't know if broadcastManager is needed in local mode tho.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2100) Allow users to disable Jetty Spark UI in local mode

2014-06-16 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033055#comment-14033055
]

Sean Owen edited comment on SPARK-2100 at 6/16/14 9:55 PM:
---

Yes, the Maven build has to do a little work to exclude copies of the Servlet
2.x API. Spark ends up including one copy of the Servlet 3.0 APIs, which should
make everybody happy. But if your build brings back in something else, and it's
bringing its own Servlet API, you may need to exclude it. (This dependency is
super annoying because different containers have distributed the same classes
in different artifacts.)

Advert break: SPARK-1949 fixes this type of issue for Spark's own SBT-based
build. Not exactly the issue here but related, and would be cool to get it
committed. https://issues.apache.org/jira/browse/SPARK-1949

was (Author: srowen):
Yes, the Maven build has to do a little work to exclude copies of the Servlet
2.x API. Spark ends up including one copy of the Servlet 3.0 APIs, which should
everybody happing. But if your build brings back in something else, and it's
bringing its own Servlet API, you may need to exclude it. (This dependency is
super annoying because different containers have distributed the same classes
in different artifacts.)

Allow users to disable Jetty Spark UI in local mode
---

Key: SPARK-2100
URL: https://issues.apache.org/jira/browse/SPARK-2100
Project: Spark
Issue Type: Improvement
Reporter: DB Tsai

Since we want to use Spark hadoop APIs in local mode for design time to
explore the first couple hundred lines of data in HDFS. Also, we want to use
Spark in our tomcat application, so starting a jetty UI will make our tomcat
unhappy. In those scenarios, Spark UI is not necessary, and wasting resource.
As a result, for local mode, it's desirable that users are able to disable
the spark UI.
Couple places I found where the jetty will be started.
In SparkEnv.scala
1) val broadcastManager = new BroadcastManager(isDriver, conf,
securityManager)
2) val httpFileServer = new HttpFileServer(securityManager)
httpFileServer.initialize()
I don't know if broadcastManager is needed in local mode tho.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2100) Allow users to disable Jetty Spark UI in local mode

2014-06-16 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033055#comment-14033055
]

Sean Owen commented on SPARK-2100:
--

Yes, the Maven build has to do a little work to exclude copies of the Servlet
2.x API. Spark ends up including one copy of the Servlet 3.0 APIs, which should
everybody happing. But if your build brings back in something else, and it's
bringing its own Servlet API, you may need to exclude it. (This dependency is
super annoying because different containers have distributed the same classes
in different artifacts.)

Allow users to disable Jetty Spark UI in local mode
---

Key: SPARK-2100
URL: https://issues.apache.org/jira/browse/SPARK-2100
Project: Spark
Issue Type: Improvement
Reporter: DB Tsai

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2160) error of Decision tree algorithm in Spark MLlib

2014-06-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033550#comment-14033550
 ] 

Sean Owen commented on SPARK-2160:
--

You already added this as https://issues.apache.org/jira/browse/SPARK-2152 
right?

 error of  Decision tree algorithm  in Spark MLlib 
 --

 Key: SPARK-2160
 URL: https://issues.apache.org/jira/browse/SPARK-2160
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: caoli
  Labels: patch
 Fix For: 1.1.0

   Original Estimate: 4h
  Remaining Estimate: 4h

 the error of comput rightNodeAgg about  Decision tree algorithm  in Spark 
 MLlib  , in the function extractLeftRightNodeAggregates() ,when compute 
 rightNodeAgg  used bindata index is error. in the DecisionTree.scala file 
 about  Line980:
  rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
 binData(shift + (2 * (numBins - 2 - splitIndex))) +
   rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))  
   
  the   binData(shift + (2 * (numBins - 2 - splitIndex)))  index compute is 
 error, so the result of rightNodeAgg  include  repeated data about bins  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow

2014-06-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038944#comment-14038944
 ] 

Sean Owen commented on SPARK-2223:
--

Is it not just because there are a lot of tests, and they generally can't be 
run in parallel? I agree, an hour for tests is really long, but I'm not sure if 
there is a problem except having lots of tests. You can run subsets of tests 
with Maven though to test targeted changes.

 Building and running tests with maven is extremely slow
 ---

 Key: SPARK-2223
 URL: https://issues.apache.org/jira/browse/SPARK-2223
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Thomas Graves

 For some reason using maven with Spark is extremely slow.  Building and 
 running tests takes way longer then other projects I have used that use 
 maven.  We should investigate to see why.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2224) allow running tests for one sub module

2014-06-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038956#comment-14038956
 ] 

Sean Owen commented on SPARK-2224:
--

mvn test -pl [module], no?

 allow running tests for one sub module
 --

 Key: SPARK-2224
 URL: https://issues.apache.org/jira/browse/SPARK-2224
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.0.0
Reporter: Thomas Graves

 We should have a way to run just the unit tests in a submodule (like core or 
 yarn, etc.).
 One way would be to support changing directories into the submodule and 
 running mvn test from there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1568) Spark 0.9.0 hangs reading s3

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039797#comment-14039797
 ] 

Sean Owen commented on SPARK-1568:
--

Sam, did the other recent changes to S3 deps resolve this, do you think?

 Spark 0.9.0 hangs reading s3
 

 Key: SPARK-1568
 URL: https://issues.apache.org/jira/browse/SPARK-1568
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've tried several jobs now and many of the tasks complete, then it get stuck 
 and just hangs.  The exact same jobs function perfectly fine if I distcp to 
 hdfs first and read from hdfs.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039802#comment-14039802
 ] 

Sean Owen commented on SPARK-2223:
--

On a latest-generation Macbook Pro here, a full 'mvn clean install' takes 91:50 
without zinc. With zinc, it's 51:02.

 Building and running tests with maven is extremely slow
 ---

 Key: SPARK-2223
 URL: https://issues.apache.org/jira/browse/SPARK-2223
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Thomas Graves

 For some reason using maven with Spark is extremely slow.  Building and 
 running tests takes way longer then other projects I have used that use 
 maven.  We should investigate to see why.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1339) Build error: org.eclipse.paho:mqtt-client

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039804#comment-14039804
 ] 

Sean Owen commented on SPARK-1339:
--

(Just cruising some old issues) I can't reproduce this, and this is a general 
symptom of a repo not being accesible. It's actually nothing to do with 
mqtt-client per se. Also, we've fixed some repo issues along the way.

 Build error: org.eclipse.paho:mqtt-client
 -

 Key: SPARK-1339
 URL: https://issues.apache.org/jira/browse/SPARK-1339
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.0
Reporter: Ken Williams

 Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded.  
 The Maven error is:
 {code}
 [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
 resolve dependencies for project 
 org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
 artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
 {code}
 My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
 Is there an additional Maven repository I should add or something?
 If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
 {{examples}} modules, the build succeeds.  I'm fine without the MQTT stuff, 
 but I would really like to get the examples working because I haven't played 
 with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1138) Spark 0.9.0 does not work with Hadoop / HDFS

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039805#comment-14039805
 ] 

Sean Owen commented on SPARK-1138:
--

This is no longer observed in the unit tests. The comments here say it was a 
Netty dependency problem, and I know that has since been cleaned up. Suggest 
this is resolved then?

 Spark 0.9.0 does not work with Hadoop / HDFS
 

 Key: SPARK-1138
 URL: https://issues.apache.org/jira/browse/SPARK-1138
 Project: Spark
  Issue Type: Bug
Reporter: Sam Abeyratne

 UPDATE: This problem is certainly related to trying to use Spark 0.9.0 and 
 the latest cloudera Hadoop / HDFS in the same jar.  It seems no matter how I 
 fiddle with the deps, the do not play nice together.
 I'm getting a java.util.concurrent.TimeoutException when trying to create a 
 spark context with 0.9.  I cannot, whatever I do, change the timeout.  I've 
 tried using System.setProperty, the SparkConf mechanism of creating a 
 SparkContext and the -D flags when executing my jar.  I seem to be able to 
 run simple jobs from the spark-shell OK, but my more complicated jobs require 
 external libraries so I need to build jars and execute them.
 Some code that causes this:
 println(Creating config)
 val conf = new SparkConf()
   .setMaster(clusterMaster)
   .setAppName(MyApp)
   .setSparkHome(sparkHome)
   .set(spark.akka.askTimeout, parsed.getOrElse(timeouts, 100))
   .set(spark.akka.timeout, parsed.getOrElse(timeouts, 100))
 println(Creating sc)
 implicit val sc = new SparkContext(conf)
 The output:
 Creating config
 Creating sc
 log4j:WARN No appenders could be found for logger 
 (akka.event.slf4j.Slf4jLogger).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
 info.
 [ERROR] [02/26/2014 11:05:25.491] [main] [Remoting] Remoting error: [Startup 
 timed out] [
 akka.remote.RemoteTransportException: Startup timed out
   at 
 akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
   at akka.remote.Remoting.start(Remoting.scala:191)
   at 
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
   at 
 org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126)
   at org.apache.spark.SparkContext.init(SparkContext.scala:139)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
 [1 milliseconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at akka.remote.Remoting.start(Remoting.scala:173)
   ... 11 more
 ]
 Exception in thread main java.util.concurrent.TimeoutException: Futures 
 timed out after [1 milliseconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at akka.remote.Remoting.start(Remoting.scala:173)
   at 
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
   at 
 org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126)
   at org.apache.spark.SparkContext.init(SparkContext.scala:139)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40)
   at

[jira] [Commented] (SPARK-1675) Make clear whether computePrincipalComponents centers data

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039944#comment-14039944
 ] 

Sean Owen commented on SPARK-1675:
--

Is this still valid? Looking at the code, PCA is computed as the SVD of the 
covariance matrix. The means implicitly don't matter. they are not explicitly 
subtracted, and do not matter. Or is there still a doc change desired?

 Make clear whether computePrincipalComponents centers data
 --

 Key: SPARK-1675
 URL: https://issues.apache.org/jira/browse/SPARK-1675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1846) RAT checks should exclude logs/ directory

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039945#comment-14039945
 ] 

Sean Owen commented on SPARK-1846:
--

Just looking over some old JIRAs. This appears to be resolved already. logs is 
excluded.

 RAT checks should exclude logs/ directory
 -

 Key: SPARK-1846
 URL: https://issues.apache.org/jira/browse/SPARK-1846
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Andrew Ash

 When there are logs in the logs/ directory, the rat check from 
 ./dev/check-license fails.
 ```
 aash@aash-mbp ~/git/spark$ find logs -type f
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4
 logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5
 logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out
 logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4
 logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5
 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out
 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1
 logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2
 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out
 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1
 logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2
 aash@aash-mbp ~/git/spark$ ./dev/check-license
 Could not find Apache license headers in the following files:
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.1
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.2
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.3
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.4
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.master.Master-1-aash-mbp.local.out.5
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker--aash-mbp.local.out.1
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.1
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.2
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.3
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.4
  !? 
 /Users/aash/git/spark/logs/spark-aash-org.apache.spark.deploy.worker.Worker-1-aash-mbp.local.out.5
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.1
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.master.Master-1-aash-mbp.local.out.2
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.1
  !? 
 /Users/aash/git/spark/logs/spark-aash-spark.deploy.worker.Worker-1-aash-mbp.local.out.2
 aash@aash-mbp ~/git/spark$
 ```



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1804) Mark 0.9.1 as released in JIRA

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039947#comment-14039947
 ] 

Sean Owen commented on SPARK-1804:
--

Looks like this can be closed as resolved.
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:versions-panel

 Mark 0.9.1 as released in JIRA
 --

 Key: SPARK-1804
 URL: https://issues.apache.org/jira/browse/SPARK-1804
 Project: Spark
  Issue Type: Task
  Components: Documentation, Project Infra
Affects Versions: 0.9.1
Reporter: Stevo Slavic
Priority: Trivial

 0.9.1 has been released but is labeled as unreleased in SPARK JIRA project. 
 Please have it marked as released. Also please document that step in release 
 process.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1803) Rename test resources to be compatible with Windows FS

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039948#comment-14039948
 ] 

Sean Owen commented on SPARK-1803:
--

PR was committed so this is another that seems to be closeable.

 Rename test resources to be compatible with Windows FS
 --

 Key: SPARK-1803
 URL: https://issues.apache.org/jira/browse/SPARK-1803
 Project: Spark
  Issue Type: Task
  Components: Windows
Affects Versions: 0.9.1
Reporter: Stevo Slavic
Priority: Trivial

 {{git clone}} of master branch and then {{git status}} on Windows reports 
 untracked files:
 {noformat}
 # Untracked files:
 #   (use git add file... to include in what will be committed)
 #
 #   sql/hive/src/test/resources/golden/Column pruning
 #   sql/hive/src/test/resources/golden/Partition pruning
 #   sql/hive/src/test/resources/golden/Partiton pruning
 {noformat}
 Actual issue is that several files under 
 {{sql/hive/src/test/resources/golden}} directory have colon in name which is 
 invalid character in file name on Windows.
 Please have these files renamed to a Windows compatible file name.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1046) Enable to build behind a proxy.

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039950#comment-14039950
 ] 

Sean Owen commented on SPARK-1046:
--

Is this stale / resolved? I don't see this in the code at this point.

 Enable to build behind a proxy.
 ---

 Key: SPARK-1046
 URL: https://issues.apache.org/jira/browse/SPARK-1046
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.8.1
Reporter: Kousuke Saruta
Priority: Minor

 I tried to build spark-0.8.1 behind proxy and failed although I set 
 http/https.proxyHost, proxyPort, proxyUser, proxyPassword.
 I found it's caused by accessing  github using git protocol (git://).
 The URL is hard-corded in SparkPluginBuild.scala as follows.
 {code}
 lazy val junitXmlListener = 
 uri(git://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce)
 {code}
 After I rewrite the URL as follows, I could build successfully.
 {code}
 lazy val junitXmlListener = 
 uri(https://github.com/ijuma/junit_xml_listener.git#fe434773255b451a38e8d889536ebc260f4225ce;)
 {code}
 I think we should be able to build whether we are behind a proxy or not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-721) Fix remaining deprecation warnings

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039951#comment-14039951
 ] 

Sean Owen commented on SPARK-721:
-

This appears to be resolved as I don't think these warnings have been in the 
build for a while.

 Fix remaining deprecation warnings
 --

 Key: SPARK-721
 URL: https://issues.apache.org/jira/browse/SPARK-721
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.7.1
Reporter: Josh Rosen
Assignee: Gary Struthers
Priority: Minor
  Labels: Starter

 The recent patch to re-enable deprecation warnings fixed many of them, but 
 there's still a few left; it would be nice to fix them.
 For example, here's one in RDDSuite:
 {code}
 [warn] 
 /Users/joshrosen/Documents/spark/spark/core/src/test/scala/spark/RDDSuite.scala:32:
  method mapPartitionsWithSplit in class RDD is deprecated: use 
 mapPartitionsWithIndex
 [warn] val partitionSumsWithSplit = nums.mapPartitionsWithSplit {
 [warn]   ^
 [warn] one warning found
 {code}
 Also, it looks like Scala 2.9 added a second deprecatedSince parameter to 
 @Deprecated.   We didn't fill this in, which causes some additional warnings:
 {code}
 [warn] 
 /Users/joshrosen/Documents/spark/spark/core/src/main/scala/spark/RDD.scala:370:
  @deprecated now takes two arguments; see the scaladoc.
 [warn]   @deprecated(use mapPartitionsWithIndex)
 [warn]^
 [warn] one warning found
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1996) Remove use of special Maven repo for Akka

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039965#comment-14039965
 ] 

Sean Owen commented on SPARK-1996:
--

PR: https://github.com/apache/spark/pull/1170

 Remove use of special Maven repo for Akka
 -

 Key: SPARK-1996
 URL: https://issues.apache.org/jira/browse/SPARK-1996
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Matei Zaharia
 Fix For: 1.0.1


 According to http://doc.akka.io/docs/akka/2.3.3/intro/getting-started.html 
 Akka is now published to Maven Central, so our documentation and POM files 
 don't need to use the old Akka repo. It will be one less step for users to 
 worry about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1316) Remove use of Commons IO

2014-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039991#comment-14039991
 ] 

Sean Owen commented on SPARK-1316:
--

PR: https://github.com/apache/spark/pull/1173

Actually, Commons IO is not even a dependency right now.

 Remove use of Commons IO
 

 Key: SPARK-1316
 URL: https://issues.apache.org/jira/browse/SPARK-1316
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Sean Owen
Priority: Minor

 (This follows from a side point on SPARK-1133, in discussion of the PR: 
 https://github.com/apache/spark/pull/164 )
 Commons IO is barely used in the project, and can easily be replaced with 
 equivalent calls to Guava or the existing Spark Utils.scala class.
 Removing a dependency feels good, and this one in particular can get a little 
 problematic since Hadoop uses it too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2249) how to convert rows of schemaRdd into HashMaps

2014-06-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041766#comment-14041766
 ] 

Sean Owen commented on SPARK-2249:
--

This is where issues are reported, rather than where questions are asked. I 
think this should be closed. Use the user@ mailing list instead.

 how to convert rows of schemaRdd into HashMaps
 --

 Key: SPARK-2249
 URL: https://issues.apache.org/jira/browse/SPARK-2249
 Project: Spark
  Issue Type: Question
Reporter: jackielihf

 spark 1.0.0
 how to convert rows of schemaRdd into HashMaps using column names as keys?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2257) The algorithm of ALS in mlib lacks a parameter

2014-06-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042025#comment-14042025
 ] 

Sean Owen commented on SPARK-2257:
--

I don't think this is a bug, in the sense that it is just a different 
formulation of ALS. It's in the ALS-WR paper, but not the more well-known 
Hu/Koren/Volinsky paper. 

This is weighted regularization and it does help in some cases. In fact, it's 
already implemented in MLlib, although went in just after 1.0.0:

https://github.com/apache/spark/commit/a6e0afdcf0174425e8a6ff20b2bc2e3a7a374f19#diff-2b593e0b4bd6eddab37f04968baa826c

I think this is therefore already implemented.

 The algorithm of ALS in mlib lacks a parameter 
 ---

 Key: SPARK-2257
 URL: https://issues.apache.org/jira/browse/SPARK-2257
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
 Environment: spark 1.0
Reporter: zhengbing li
  Labels: patch
 Fix For: 1.1.0

   Original Estimate: 336h
  Remaining Estimate: 336h

 When I test ALS algorithm using netflix data, I find I cannot get the acurate 
 results declared by the paper. The best  MSE value is 0.9066300038109709(RMSE 
 0.952), which is worse than the paper's result. If I increase the number of 
 features or the number of iterations, I will get a worse result. After I 
 studing the paper and source code, I find a bug in the updateBlock function 
 of ALS.
 orgin code is:
 while (i  rank) {
 // ---
fullXtX.data(i * rank + i) += lambda
 i += 1
   }
 The code doesn't consider the number of products that one user rates. So this 
 code should be modified:
 while (i  rank) {
  
 //ratingsNum(index) equals the number of products that a user rates
 fullXtX.data(i * rank + i) += lambda * ratingsNum(index)
 i += 1
   } 
 After I modify code, the MSE value has been decreased, this is one test result
 conditions:
 val numIterations =20
 val features = 30
 val model = ALS.train(trainRatings,features, numIterations, 0.06)
 result of modified version:
 MSE: Double = 0.8472313396478773
 RMSE: 0.92045
 results of version of 1.0
 MSE: Double = 1.2680743123043832
 RMSE: 1.1261
 In order to add the vector ratingsNum, I want to change the InLinkBlock 
 structure as follows:
 private[recommendation] case class InLinkBlock(elementIds: Array[Int], 
 ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], 
 Array[Double])]])
 So I could calculte the vector ratingsNum in the function of makeInLinkBlock. 
 This is the code I add in the makeInLinkBlock:
 ... 
 //added 
   val ratingsNum = new Array[Int](numUsers)
ratings.map(r = ratingsNum(userIdToPos(r.user)) += 1)
 //end of added
   InLinkBlock(userIds, ratingsNum, ratingsForBlock)
 
 Is this solution reasonable??



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2251) MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition

2014-06-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042738#comment-14042738
 ] 

Sean Owen commented on SPARK-2251:
--

For what it's worth, I can reproduce this. In the sample, the test RDD has 2 
partitions, containing 2 and 1 examples. The prediction RDD has 2 partitions, 
containing 1 and 2 examples respectively. So they aren't matched up, even 
though one is a 1-1 map() of the other. 

That seems like it shouldn't happen? maybe someone more knowledgeable can say 
whether that itself should occur. test is a PartitionwiseSampledRDD and 
prediction is a MappedRDD of course. 

If it is allowed to happen, then the example should be fixed, and I could 
easily supply a patch. It can be done without having to zip up RDDs to begin 
with.

 MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number 
 of elements in each partition
 --

 Key: SPARK-2251
 URL: https://issues.apache.org/jira/browse/SPARK-2251
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
 Environment: OS: Fedora Linux
 Spark Version: 1.0.0. Git clone from the Spark Repository
Reporter: Jun Xie
Priority: Minor
  Labels: Naive-Bayes

 I follow the exact code from Naive Bayes Example 
 (http://spark.apache.org/docs/latest/mllib-naive-bayes.html) of MLLib.
 When I executed the final command: 
 val accuracy = 1.0 * predictionAndLabel.filter(x = x._1 == x._2).count() / 
 test.count()
 It complains Can only zip RDDs with same number of elements in each 
 partition.
 I got the following exception:
 14/06/23 19:39:23 INFO SparkContext: Starting job: count at console:31
 14/06/23 19:39:23 INFO DAGScheduler: Got job 3 (count at console:31) with 2 
 output partitions (allowLocal=false)
 14/06/23 19:39:23 INFO DAGScheduler: Final stage: Stage 4(count at 
 console:31)
 14/06/23 19:39:23 INFO DAGScheduler: Parents of final stage: List()
 14/06/23 19:39:23 INFO DAGScheduler: Missing parents: List()
 14/06/23 19:39:23 INFO DAGScheduler: Submitting Stage 4 (FilteredRDD[14] at 
 filter at console:31), which has no missing parents
 14/06/23 19:39:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 
 (FilteredRDD[14] at filter at console:31)
 14/06/23 19:39:23 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:0 as TID 8 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:0 as 3410 bytes in 
 0 ms
 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:1 as TID 9 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:1 as 3410 bytes in 
 1 ms
 14/06/23 19:39:23 INFO Executor: Running task ID 8
 14/06/23 19:39:23 INFO Executor: Running task ID 9
 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally
 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24
 14/06/23 19:39:23 ERROR Executor: Exception in task ID 9
 org.apache.spark.SparkException: Can only zip RDDs with same number of 
 elements in each partition
   at 
 org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:663)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1067)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:724)
 14/06/23 19:39:23 ERROR Executor: Exception in task ID 8

[jira] [Commented] (SPARK-2268) Utils.createTempDir() creates race with HDFS at shutdown

2014-06-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043102#comment-14043102
 ] 

Sean Owen commented on SPARK-2268:
--

Yeah this hook is deleting local files, not HDFS files. I don't think it can 
interact with Hadoop APIs or else it fails when used without Hadoop.

 Utils.createTempDir() creates race with HDFS at shutdown
 

 Key: SPARK-2268
 URL: https://issues.apache.org/jira/browse/SPARK-2268
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin

 Utils.createTempDir() has this code:
 {code}
 // Add a shutdown hook to delete the temp dir when the JVM exits
 Runtime.getRuntime.addShutdownHook(new Thread(delete Spark temp dir  + 
 dir) {
   override def run() {
 // Attempt to delete if some patch which is parent of this is not 
 already registered.
 if (! hasRootAsShutdownDeleteDir(dir)) Utils.deleteRecursively(dir)
   }
 })
 {code}
 This creates a race with the shutdown hooks registered by HDFS, since the 
 order of execution is undefined; if the HDFS hooks run first, you'll get 
 exceptions about the file system being closed.
 Instead, this should use Hadoop's ShutdownHookManager with a proper priority, 
 so that it runs before the HDFS hooks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2251) MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number of elements in each partition

2014-06-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043169#comment-14043169
 ] 

Sean Owen commented on SPARK-2251:
--

Well the change to the examples is pretty straightforward. Instead of 
separately computing predictions, you just:

{code}
val predictionAndLabel = test.map(x = (model.predict(x.features), x.label))
{code}

... and similarly for other languages, and other examples. In fact it seems 
more straightforward.

But I am wondering if this is actually a bug in PartitionwiseSampledRDD. 
[~mengxr] is this a bit of code you wrote or are familiar with?

 MLLib Naive Bayes Example SparkException: Can only zip RDDs with same number 
 of elements in each partition
 --

 Key: SPARK-2251
 URL: https://issues.apache.org/jira/browse/SPARK-2251
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
 Environment: OS: Fedora Linux
 Spark Version: 1.0.0. Git clone from the Spark Repository
Reporter: Jun Xie
Priority: Minor
  Labels: Naive-Bayes

 I follow the exact code from Naive Bayes Example 
 (http://spark.apache.org/docs/latest/mllib-naive-bayes.html) of MLLib.
 When I executed the final command: 
 val accuracy = 1.0 * predictionAndLabel.filter(x = x._1 == x._2).count() / 
 test.count()
 It complains Can only zip RDDs with same number of elements in each 
 partition.
 I got the following exception:
 14/06/23 19:39:23 INFO SparkContext: Starting job: count at console:31
 14/06/23 19:39:23 INFO DAGScheduler: Got job 3 (count at console:31) with 2 
 output partitions (allowLocal=false)
 14/06/23 19:39:23 INFO DAGScheduler: Final stage: Stage 4(count at 
 console:31)
 14/06/23 19:39:23 INFO DAGScheduler: Parents of final stage: List()
 14/06/23 19:39:23 INFO DAGScheduler: Missing parents: List()
 14/06/23 19:39:23 INFO DAGScheduler: Submitting Stage 4 (FilteredRDD[14] at 
 filter at console:31), which has no missing parents
 14/06/23 19:39:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 
 (FilteredRDD[14] at filter at console:31)
 14/06/23 19:39:23 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:0 as TID 8 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:0 as 3410 bytes in 
 0 ms
 14/06/23 19:39:23 INFO TaskSetManager: Starting task 4.0:1 as TID 9 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/06/23 19:39:23 INFO TaskSetManager: Serialized task 4.0:1 as 3410 bytes in 
 1 ms
 14/06/23 19:39:23 INFO Executor: Running task ID 8
 14/06/23 19:39:23 INFO Executor: Running task ID 9
 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally
 14/06/23 19:39:23 INFO BlockManager: Found block broadcast_0 locally
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:0+24
 14/06/23 19:39:23 INFO HadoopRDD: Input split: 
 file:/home/jun/open_source/spark/mllib/data/sample_naive_bayes_data.txt:24+24
 14/06/23 19:39:23 ERROR Executor: Exception in task ID 9
 org.apache.spark.SparkException: Can only zip RDDs with same number of 
 elements in each partition
   at 
 org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:663)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1067)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:858)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1079)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:724)
 14/06/23 19:39:23 ERROR Executor: Exception in task ID 8
 org.apache.spark.SparkException: Can only zip RDDs with same number of 
 elements in each partition
   at 
 org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:663)
   at

[jira] [Comment Edited] (SPARK-2293) Replace RDD.zip usage by map with predict inside.

2014-06-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045874#comment-14045874
 ] 

Sean Owen edited comment on SPARK-2293 at 6/27/14 8:53 PM:
---

I can make a PR for the example changes since I was already looking at this, 
unless you've already got it done. As for a new method -- kind of a toss-up 
between the small added convenience and adding another method to the API. For 
my part I found it clear to just write a map call.

https://github.com/apache/spark/pull/1250


was (Author: srowen):
I can make a PR for the example changes since I was already looking at this, 
unless you've already got it done. As for a new method -- kind of a toss-up 
between the small added convenience and adding another method to the API. For 
my part I found it clear to just write a map call.

 Replace RDD.zip usage by map with predict inside.
 -

 Key: SPARK-2293
 URL: https://issues.apache.org/jira/browse/SPARK-2293
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Minor

 In our guide, we use
 {code}
 val prediction = model.predict(test.map(_.features))
 val predictionAndLabel = prediction.zip(test.map(_.label))
 {code}
 This is not efficient because test will be computed twice. We should change 
 it to
 {code}
 val predictionAndLabel = test.map(p = (model.predict(p.features), p.label))
 {code}
 It is also nice to add a `predictWith` method to predictive models.
 {code}
 def predictWith[V](RDD[(Vector, V)]): RDD[(Double, V)]
 {code}
 But I'm not sure whether this is a good name. `predictWithValue`?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1945) Add full Java examples in MLlib docs

2014-06-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047101#comment-14047101
 ] 

Sean Owen commented on SPARK-1945:
--

As an aside, there is also JavaPairRDD, which is a better specialization of 
JavaRDDTuple2?,?. But here you're trying to call Scala APIs so you can't 
make it expose JavaPairRDD. For that there would need to be a specialized Java 
version of this API.

 Add full Java examples in MLlib docs
 

 Key: SPARK-1945
 URL: https://issues.apache.org/jira/browse/SPARK-1945
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Matei Zaharia
  Labels: Starter
 Fix For: 1.0.0


 Right now some of the Java tabs only say the following:
 All of MLlib’s methods use Java-friendly types, so you can import and call 
 them there the same way you do in Scala. The only caveat is that the methods 
 take Scala RDD objects, while the Spark Java API uses a separate JavaRDD 
 class. You can convert a Java RDD to a Scala one by calling .rdd() on your 
 JavaRDD object.
 Would be nice to translate the Scala code into Java instead.
 Also, a few pages (most notably the Matrix one) don't have Java examples at 
 all.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2330) Spark shell has weird scala semantics

2014-06-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047689#comment-14047689
 ] 

Sean Owen commented on SPARK-2330:
--

I can't reproduce this in HEAD right now. Try that?
This also sounds like a potential duplicate of 
https://issues.apache.org/jira/browse/SPARK-1199

 Spark shell has weird scala semantics
 -

 Key: SPARK-2330
 URL: https://issues.apache.org/jira/browse/SPARK-2330
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.0
 Environment: Ubuntu 14.04 with spark-x.x.x-bin-hadoop2
Reporter: Andrea Ferretti
  Labels: scala, shell

 Normal scala expressions are interpreted in a strange way in the spark shell. 
 For instance
 {noformat}
 case class Foo(x: Int)
 def print(f: Foo) = f.x
 val f = Foo(3)
 print(f)
 console:24: error: type mismatch;
  found   : Foo
  required: Foo
 {noformat}
 For another example
 {noformat}
 trait Currency
 case object EUR extends Currency
 case object USD extends Currency
 def nextCurrency: Currency = nextInt(2) match {
   case 0 = EUR
   case _ = USD
 }
 console:22: error: type mismatch;
  found   : EUR.type
  required: Currency
  case 0 = EUR
 console:24: error: type mismatch;
  found   : USD.type
  required: Currency
  case _ = USD
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2331) SparkContext.emptyRDD has wrong return type

2014-06-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14048041#comment-14048041
 ] 

Sean Owen edited comment on SPARK-2331 at 6/30/14 7:48 PM:
---

My 2 cents. 

{code}
val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

yields

{code}
console:12: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[String]
 required: org.apache.spark.rdd.RDD[Nothing]
Note: String : Nothing, but class RDD is invariant in type T.
You may wish to define T as -T instead. (SLS 4.5)
   val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

but even

{code}
val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

yields

{code}
console:12: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[String]
 required: org.apache.spark.rdd.EmptyRDD[String]
   val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

So I think this is really about RDDs being invariant, rather than the return 
type here, and that seems to be how it's going to be: 
https://issues.apache.org/jira/browse/SPARK-1296

I think there's an argument for hiding EmptyRDD although that would be a little 
API change at this point.




was (Author: srowen):
My 2 cents. You mean the type is EmptyRDD[Nothing] right? 

{code}
val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

yields

{code}
console:12: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[String]
 required: org.apache.spark.rdd.RDD[Nothing]
Note: String : Nothing, but class RDD is invariant in type T.
You may wish to define T as -T instead. (SLS 4.5)
   val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

but even

{code}
val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

yields

{code}
console:12: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[String]
 required: org.apache.spark.rdd.EmptyRDD[String]
   val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

So I think this is really about RDDs being invariant, rather than the return 
type here, and that seems to be how it's going to be: 
https://issues.apache.org/jira/browse/SPARK-1296

I think there's an argument for hiding EmptyRDD although that would be a little 
API change at this point.



 SparkContext.emptyRDD has wrong return type
 ---

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Ian Hummel

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD has wrong return type

2014-06-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14048041#comment-14048041
 ] 

Sean Owen commented on SPARK-2331:
--

My 2 cents. You mean the type is EmptyRDD[Nothing] right? 

{code}
val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

yields

{code}
console:12: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[String]
 required: org.apache.spark.rdd.RDD[Nothing]
Note: String : Nothing, but class RDD is invariant in type T.
You may wish to define T as -T instead. (SLS 4.5)
   val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

but even

{code}
val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

yields

{code}
console:12: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[String]
 required: org.apache.spark.rdd.EmptyRDD[String]
   val rdds = Seq(a,b,c).foldLeft(sc.emptyRDD[String]){ (rdd,path) = 
rdd.union(sc.textFile(path)) }
{code}

So I think this is really about RDDs being invariant, rather than the return 
type here, and that seems to be how it's going to be: 
https://issues.apache.org/jira/browse/SPARK-1296

I think there's an argument for hiding EmptyRDD although that would be a little 
API change at this point.



 SparkContext.emptyRDD has wrong return type
 ---

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Ian Hummel

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 16038 matches

Mail list logo