hm no they don't push different binary releases to maven. I assume they only push the default one.
On Tue, Oct 21, 2014 at 3:26 PM, Dmitriy Lyubimov <[email protected]> wrote: > ps i remember discussion for packaging binary spark distributions. So > there's in fact a number of different spark artifact releases. However, i > am not sure if they are pushing them to mvn repositories. (if they did, > they might use different maven classifiers for those). If that's the case, > then one plausible strategy here is to recommend rebuilding mahout with > dependency to a classifier corresponding to the actual spark binary release > used. > > On Tue, Oct 21, 2014 at 2:21 PM, Dmitriy Lyubimov <[email protected]> > wrote: > >> if you are using mahout shell or command line drivers (which i dont) it >> would seem the correct thing to do is for mahout script simply to take >> spark dependencies from installed $SPARK_HOME rather than from Mahout's >> assembly. In fact that would be consistent with what other projects are >> doing in similar situation. it should also probably make things compatible >> between minor releases of spark. >> >> But i think you are right in a sense that the problem is that spark jars >> are not uniquely encompassed by maven artifact id and version, unlike with >> most other products. (e.g. if we see mahout-math-0.9.jar we expect there to >> be one and only one released artifact in existence -- but one's local build >> may create incompatible variations). >> >> On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <[email protected]> >> wrote: >> >>> The problem is not in building Spark it is in building Mahout using the >>> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are >>> in the repos. >>> >>> For the rest of us, though the process below seems like an error prone >>> hack to me it does work on Linux and BSD/mac. It should really be addressed >>> by Spark imo. >>> >>> BTW The cache is laid out differently on linux but I don’t think you >>> need to delete is anyway. >>> >>> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>> >>> fwiw i never built spark using maven. Always use sbt assembly. >>> >>> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <[email protected]> >>> wrote: >>> >>> > Ok, the mystery is solved. >>> > >>> > The safe sequence from my limited testing is: >>> > 1) delete ~/.m2/repository/org/spark and mahout >>> > 2) build Spark for your version of Hadoop *but do not use "mvn package >>> > ...”* use “mvn install …” This will put a copy of the exact bits you >>> need >>> > into the maven cache for building mahout against. In my case using >>> hadoop >>> > 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If >>> you >>> > run tests on Spark some failures can safely be ignored according to the >>> > Spark guys so check before giving up. >>> > 3) build mahout with “mvn clean install" >>> > >>> > This will create mahout from exactly the same bits you will run on your >>> > cluster. It got rid of a missing anon function for me. The problem >>> occurs >>> > when you use a different version of Spark on your cluster than you >>> used to >>> > build Mahout and this is rather hidden by Maven. Maven downloads from >>> repos >>> > any dependency that is not in the local .m2 cache and so you have to >>> make >>> > sure your version of Spark is there so Maven wont download one that is >>> > incompatible. Unless you really know what you are doing I’d build both >>> > Spark and Mahout for now >>> > >>> > BTW I will check in the Spark 1.1.0 version of Mahout once I do some >>> more >>> > testing. >>> > >>> > On Oct 21, 2014, at 10:26 AM, Pat Ferrel <[email protected]> >>> wrote: >>> > >>> > Sorry to hear. I bet you’ll find a way. >>> > >>> > The Spark Jira trail leads to two suggestions: >>> > 1) use spark-submit to execute code with your own entry point (other >>> than >>> > spark-shell) One theory points to not loading all needed Spark classes >>> from >>> > calling code (Mahout in our case). I can hand check the jars for the >>> anon >>> > function I am missing. >>> > 2) there may be different class names in the running code (created by >>> > building Spark locally) and the version referenced in the Mahout POM. >>> If >>> > this turns out to be true it means we can’t rely on building Spark >>> locally. >>> > Is there a maven target that puts the artifacts of the Spark build in >>> the >>> > .m2/repository local cache? That would be an easy way to test this >>> theory. >>> > >>> > either of these could cause missing classes. >>> > >>> > >>> > On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>> > >>> > no i havent used it with anything but 1.0.1 and 0.9.x . >>> > >>> > on a side note, I just have changed my employer. It is one of these big >>> > guys that make it very difficult to do any contributions. So I am not >>> sure >>> > how much of anything i will be able to share/contribute. >>> > >>> > On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <[email protected]> >>> wrote: >>> > >>> >> But unless you have the time to devote to errors avoid it. I’ve built >>> >> everything from scratch using 1.0.2 and 1.1.0 and am getting these and >>> >> missing class errors. The 1.x branch seems to have some kind of >>> peculiar >>> >> build order dependencies. The errors sometimes don’t show up until >>> > runtime, >>> >> passing all build tests. >>> >> >>> >> Dmitriy, have you successfully used any Spark version other than >>> 1.0.1 on >>> >> a cluster? If so do you recall the exact order and from what sources >>> you >>> >> built? >>> >> >>> >> >>> >> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <[email protected]> >>> wrote: >>> >> >>> >> You can't use spark client of one version and have the backend of >>> > another. >>> >> You can try to change spark dependency in mahout poms to match your >>> > backend >>> >> (or vice versa, you can change your backend to match what's on the >>> > client). >>> >> >>> >> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija < >>> > [email protected] >>> >>> >>> >> wrote: >>> >> >>> >>> Hi All, >>> >>> >>> >>> Here are the errors I get which I run in a pseudo distributed mode, >>> >>> >>> >>> Spark 1.0.2 and Mahout latest code (Clone) >>> >>> >>> >>> When I run the command in page, >>> >>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html >>> >>> >>> >>> val drmX = drmData(::, 0 until 4) >>> >>> >>> >>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class >>> >>> incompatible: stream classdesc serialVersionUID = 385418487991259089, >>> >>> local class serialVersionUID = -6766554341038829528 >>> >>> at >>> >>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592) >>> >>> at >>> >>> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>> >>> at >>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>> >>> at >>> >>> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>> >>> at >>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>> >>> at >>> >>> >>> > >>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) >>> >>> at >>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>> >>> at >>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) >>> >>> at >>> >>> >>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) >>> >>> at >>> >>> >>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836) >>> >>> at >>> >>> >>> > >>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795) >>> >>> at >>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>> >>> at >>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) >>> >>> at >>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) >>> >>> at >>> >>> >>> >> >>> > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) >>> >>> at >>> >>> >>> >> >>> > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> >>> at java.lang.Thread.run(Thread.java:701) >>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1) >>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0) >>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1) >>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0) >>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1) >>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0) >>> >>> org.apache.spark.SparkException: Job aborted due to stage failure: >>> >>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in >>> >>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException: >>> >>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc >>> >>> serialVersionUID = 385418487991259089, local class serialVersionUID = >>> >>> -6766554341038829528 >>> >>> >>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592) >>> >>> >>> >>> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>> >>> >>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>> >>> >>> >>> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>> >>> >>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>> >>> >>> >>> >>> > >>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) >>> >>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>> >>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>> >>> >>> >>> >>> >> >>> > >>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>> >>> >>> >>> >>> >> >>> > >>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) >>> >>> >>> >>> >>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) >>> >>> >>> >>> >>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836) >>> >>> >>> >>> >>> > >>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795) >>> >>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>> >>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>> >>> >>> >>> >>> >> >>> > >>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>> >>> >>> >>> >>> >> >>> > >>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) >>> >>> >>> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) >>> >>> >>> >>> >>> >> >>> > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) >>> >>> >>> >>> >>> >> >>> > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> >>> java.lang.Thread.run(Thread.java:701) >>> >>> Driver stacktrace: >>> >>> at org.apache.spark.scheduler.DAGScheduler.org >>> >>> >>> >> >>> > >>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) >>> >>> at >>> >>> >>> >> >>> > >>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >>> >>> at >>> >>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >>> >>> at scala.Option.foreach(Option.scala:236) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) >>> >>> at >>> >>> >>> >> >>> > >>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) >>> >>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) >>> >>> at akka.actor.ActorCell.invoke(ActorCell.scala:456) >>> >>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) >>> >>> at akka.dispatch.Mailbox.run(Mailbox.scala:219) >>> >>> at >>> >>> >>> >> >>> > >>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >>> >>> at >>> >>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> >>> at >>> >>> >>> >> >>> > >>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>> >>> at >>> >>> >>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> >>> at >>> >>> >>> >> >>> > >>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>> >>> >>> >>> Best, >>> >>> Mahesh Balija. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <[email protected] >>> > >>> >>> wrote: >>> >>> >>> >>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <[email protected]> >>> >>> wrote: >>> >>>> >>> >>>>> Is anyone else nervous about ignoring this issue or relying on >>> >>> non-build >>> >>>>> (hand run) test driven transitive dependency checking. I hope >>> someone >>> >>>> else >>> >>>>> will chime in. >>> >>>>> >>> >>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we >>> > set >>> >>>> up >>> >>>>> the build machine to do this? I’d feel better about eyeballing >>> deps if >>> >>> we >>> >>>>> could have a TEST_MASTER automatically run during builds at Apache. >>> >>> Maybe >>> >>>>> the regular unit tests are OK for building locally ourselves. >>> >>>>> >>> >>>>>> >>> >>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <[email protected] >>> > >>> >>>>> wrote: >>> >>>>>> >>> >>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel < >>> [email protected]> >>> >>>>> wrote: >>> >>>>>> >>> >>>>>>> Maybe a more fundamental issue is that we don’t know for sure >>> >>> whether >>> >>>> we >>> >>>>>>> have missing classes or not. The job.jar at least used the pom >>> >>>>> dependencies >>> >>>>>>> to guarantee every needed class was present. So the job.jar >>> seems to >>> >>>>> solve >>> >>>>>>> the problem but may ship some unnecessary duplicate code, right? >>> >>>>>>> >>> >>>>>> >>> >>>>>> No, as i wrote spark doesn't work with job jar format. Neither >>> as it >>> >>>>> turns >>> >>>>>> out more recent hadoop MR btw. >>> >>>>> >>> >>>>> Not speaking literally of the format. Spark understands jars and >>> maven >>> >>>> can >>> >>>>> build one from transitive dependencies. >>> >>>>> >>> >>>>>> >>> >>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES >>> to >>> >>>>> startup >>> >>>>>> tasks with all of it just on copy time). This is absolutely not >>> the >>> >>> way >>> >>>>> to >>> >>>>>> go with this. >>> >>>>>> >>> >>>>> >>> >>>>> Lack of guarantee to load seems like a bigger problem than startup >>> >>> time. >>> >>>>> Clearly we can’t just ignore this. >>> >>>>> >>> >>>> >>> >>>> Nope. given highly iterative nature and dynamic task allocation in >>> this >>> >>>> environment, one is looking to effects similar to Map Reduce. This >>> is >>> >> not >>> >>>> the only reason why I never go to MR anymore, but that's one of main >>> >>> ones. >>> >>>> >>> >>>> How about experiment: why don't you create assembly that copies ALL >>> >>>> transitive dependencies in one folder, and then try to broadcast it >>> > from >>> >>>> single point (front end) to well... let's start with 20 machines. >>> (of >>> >>>> course we ideally want to into 10^3 ..10^4 range -- but why bother >>> if >>> > we >>> >>>> can't do it for 20). >>> >>>> >>> >>>> Or, heck, let's try to simply parallel-copy it between too machines >>> 20 >>> >>>> times that are not collocated on the same subnet. >>> >>>> >>> >>>> >>> >>>>>> >>> >>>>>>> There may be any number of bugs waiting for the time we try >>> running >>> >>>> on a >>> >>>>>>> node machine that doesn’t have some class in it’s classpath. >>> >>>>>> >>> >>>>>> >>> >>>>>> No. Assuming any given method is tested on all its execution >>> paths, >>> >>>> there >>> >>>>>> will be no bugs. The bugs of that sort will only appear if the >>> user >>> >>> is >>> >>>>>> using algebra directly and calls something that is not on the >>> path, >>> >>>> from >>> >>>>>> the closure. In which case our answer to this is the same as for >>> the >>> >>>>> solver >>> >>>>>> methodology developers -- use customized SparkConf while creating >>> >>>> context >>> >>>>>> to include stuff you really want. >>> >>>>>> >>> >>>>>> Also another right answer to this is that we probably should >>> >>> reasonably >>> >>>>>> provide the toolset here. For example, all the stats stuff found >>> in R >>> >>>>> base >>> >>>>>> and R stat packages so the user is not compelled to go non-native. >>> >>>>>> >>> >>>>>> >>> >>>>> >>> >>>>> Huh? this is not true. The one I ran into was found by calling >>> >>> something >>> >>>>> in math from something in math-scala. It led outside and you can >>> >>>> encounter >>> >>>>> such things even in algebra. In fact you have no idea if these >>> >>> problems >>> >>>>> exists except for the fact you have used it a lot personally. >>> >>>>> >>> >>>> >>> >>>> >>> >>>> You ran it with your own code that never existed before. >>> >>>> >>> >>>> But there's difference between released Mahout code (which is what >>> you >>> >>> are >>> >>>> working on) and the user code. Released code must run thru remote >>> tests >>> >>> as >>> >>>> you suggested and thus guarantee there are no such problems with >>> post >>> >>>> release code. >>> >>>> >>> >>>> For users, we only can provide a way for them to load stuff that >>> they >>> >>>> decide to use. We don't have apriori knowledge what they will use. >>> It >>> > is >>> >>>> the same thing that spark does, and the same thing that MR does, >>> > doesn't >>> >>>> it? >>> >>>> >>> >>>> Of course mahout should drop rigorously the stuff it doesn't load, >>> from >>> >>> the >>> >>>> scala scope. No argue about that. In fact that's what i suggested >>> as #1 >>> >>>> solution. But there's nothing much to do here but to go dependency >>> >>>> cleansing for math and spark code. Part of the reason there's so >>> much >>> > is >>> >>>> because newer modules still bring in everything from mrLegacy. >>> >>>> >>> >>>> You are right in saying it is hard to guess what else dependencies >>> are >>> >> in >>> >>>> the util/legacy code that are actually used. but that's not a >>> >>> justification >>> >>>> for brute force "copy them all" approach that virtually guarantees >>> >>> ruining >>> >>>> one of the foremost legacy issues this work intended to address. >>> >>>> >>> >>> >>> >> >>> >> >>> > >>> > >>> > >>> >>> >> >
