either way, for now compiling spark (with push to local maven) and then mahout (which would use local maven artifacts) on the same machine and then re-distributing artifacts to worker nodes should work regardless of parameters of compilation.
On Tue, Oct 21, 2014 at 3:28 PM, Dmitriy Lyubimov <[email protected]> wrote: > hm no they don't push different binary releases to maven. I assume they > only push the default one. > > On Tue, Oct 21, 2014 at 3:26 PM, Dmitriy Lyubimov <[email protected]> > wrote: > >> ps i remember discussion for packaging binary spark distributions. So >> there's in fact a number of different spark artifact releases. However, i >> am not sure if they are pushing them to mvn repositories. (if they did, >> they might use different maven classifiers for those). If that's the case, >> then one plausible strategy here is to recommend rebuilding mahout with >> dependency to a classifier corresponding to the actual spark binary release >> used. >> >> On Tue, Oct 21, 2014 at 2:21 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >>> if you are using mahout shell or command line drivers (which i dont) it >>> would seem the correct thing to do is for mahout script simply to take >>> spark dependencies from installed $SPARK_HOME rather than from Mahout's >>> assembly. In fact that would be consistent with what other projects are >>> doing in similar situation. it should also probably make things compatible >>> between minor releases of spark. >>> >>> But i think you are right in a sense that the problem is that spark jars >>> are not uniquely encompassed by maven artifact id and version, unlike with >>> most other products. (e.g. if we see mahout-math-0.9.jar we expect there to >>> be one and only one released artifact in existence -- but one's local build >>> may create incompatible variations). >>> >>> On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <[email protected]> >>> wrote: >>> >>>> The problem is not in building Spark it is in building Mahout using the >>>> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are >>>> in the repos. >>>> >>>> For the rest of us, though the process below seems like an error prone >>>> hack to me it does work on Linux and BSD/mac. It should really be addressed >>>> by Spark imo. >>>> >>>> BTW The cache is laid out differently on linux but I don’t think you >>>> need to delete is anyway. >>>> >>>> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <[email protected]> >>>> wrote: >>>> >>>> fwiw i never built spark using maven. Always use sbt assembly. >>>> >>>> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <[email protected]> >>>> wrote: >>>> >>>> > Ok, the mystery is solved. >>>> > >>>> > The safe sequence from my limited testing is: >>>> > 1) delete ~/.m2/repository/org/spark and mahout >>>> > 2) build Spark for your version of Hadoop *but do not use "mvn package >>>> > ...”* use “mvn install …” This will put a copy of the exact bits you >>>> need >>>> > into the maven cache for building mahout against. In my case using >>>> hadoop >>>> > 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” >>>> If you >>>> > run tests on Spark some failures can safely be ignored according to >>>> the >>>> > Spark guys so check before giving up. >>>> > 3) build mahout with “mvn clean install" >>>> > >>>> > This will create mahout from exactly the same bits you will run on >>>> your >>>> > cluster. It got rid of a missing anon function for me. The problem >>>> occurs >>>> > when you use a different version of Spark on your cluster than you >>>> used to >>>> > build Mahout and this is rather hidden by Maven. Maven downloads from >>>> repos >>>> > any dependency that is not in the local .m2 cache and so you have to >>>> make >>>> > sure your version of Spark is there so Maven wont download one that is >>>> > incompatible. Unless you really know what you are doing I’d build both >>>> > Spark and Mahout for now >>>> > >>>> > BTW I will check in the Spark 1.1.0 version of Mahout once I do some >>>> more >>>> > testing. >>>> > >>>> > On Oct 21, 2014, at 10:26 AM, Pat Ferrel <[email protected]> >>>> wrote: >>>> > >>>> > Sorry to hear. I bet you’ll find a way. >>>> > >>>> > The Spark Jira trail leads to two suggestions: >>>> > 1) use spark-submit to execute code with your own entry point (other >>>> than >>>> > spark-shell) One theory points to not loading all needed Spark >>>> classes from >>>> > calling code (Mahout in our case). I can hand check the jars for the >>>> anon >>>> > function I am missing. >>>> > 2) there may be different class names in the running code (created by >>>> > building Spark locally) and the version referenced in the Mahout >>>> POM. If >>>> > this turns out to be true it means we can’t rely on building Spark >>>> locally. >>>> > Is there a maven target that puts the artifacts of the Spark build in >>>> the >>>> > .m2/repository local cache? That would be an easy way to test this >>>> theory. >>>> > >>>> > either of these could cause missing classes. >>>> > >>>> > >>>> > On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <[email protected]> >>>> wrote: >>>> > >>>> > no i havent used it with anything but 1.0.1 and 0.9.x . >>>> > >>>> > on a side note, I just have changed my employer. It is one of these >>>> big >>>> > guys that make it very difficult to do any contributions. So I am not >>>> sure >>>> > how much of anything i will be able to share/contribute. >>>> > >>>> > On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <[email protected]> >>>> wrote: >>>> > >>>> >> But unless you have the time to devote to errors avoid it. I’ve built >>>> >> everything from scratch using 1.0.2 and 1.1.0 and am getting these >>>> and >>>> >> missing class errors. The 1.x branch seems to have some kind of >>>> peculiar >>>> >> build order dependencies. The errors sometimes don’t show up until >>>> > runtime, >>>> >> passing all build tests. >>>> >> >>>> >> Dmitriy, have you successfully used any Spark version other than >>>> 1.0.1 on >>>> >> a cluster? If so do you recall the exact order and from what sources >>>> you >>>> >> built? >>>> >> >>>> >> >>>> >> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <[email protected]> >>>> wrote: >>>> >> >>>> >> You can't use spark client of one version and have the backend of >>>> > another. >>>> >> You can try to change spark dependency in mahout poms to match your >>>> > backend >>>> >> (or vice versa, you can change your backend to match what's on the >>>> > client). >>>> >> >>>> >> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija < >>>> > [email protected] >>>> >>> >>>> >> wrote: >>>> >> >>>> >>> Hi All, >>>> >>> >>>> >>> Here are the errors I get which I run in a pseudo distributed mode, >>>> >>> >>>> >>> Spark 1.0.2 and Mahout latest code (Clone) >>>> >>> >>>> >>> When I run the command in page, >>>> >>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html >>>> >>> >>>> >>> val drmX = drmData(::, 0 until 4) >>>> >>> >>>> >>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class >>>> >>> incompatible: stream classdesc serialVersionUID = >>>> 385418487991259089, >>>> >>> local class serialVersionUID = -6766554341038829528 >>>> >>> at >>>> >>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592) >>>> >>> at >>>> >>> >>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>>> >>> at >>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>>> >>> at >>>> >>> >>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>>> >>> at >>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>>> >>> at >>>> >>> >>>> > >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) >>>> >>> at >>>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>>> >>> at >>>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) >>>> >>> at >>>> >>> >>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) >>>> >>> at >>>> >>> >>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836) >>>> >>> at >>>> >>> >>>> > >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795) >>>> >>> at >>>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>>> >>> at >>>> >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) >>>> >>> at >>>> >>> >>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>> >>> at java.lang.Thread.run(Thread.java:701) >>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1) >>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0) >>>> >>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1) >>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0) >>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1) >>>> >>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0) >>>> >>> org.apache.spark.SparkException: Job aborted due to stage failure: >>>> >>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in >>>> >>> TID 6 on host mahesh-VirtualBox.local: >>>> java.io.InvalidClassException: >>>> >>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc >>>> >>> serialVersionUID = 385418487991259089, local class serialVersionUID >>>> = >>>> >>> -6766554341038829528 >>>> >>> >>>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592) >>>> >>> >>>> >>> >>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>>> >>> >>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>>> >>> >>>> >>> >>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>>> >>> >>>> >>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>>> >>> >>>> >>> >>>> > >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) >>>> >>> >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>>> >>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>>> >>> >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>>> >>> >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) >>>> >>> >>>> >>> >>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) >>>> >>> >>>> >>> >>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836) >>>> >>> >>>> >>> >>>> > >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795) >>>> >>> >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>>> >>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>>> >>> >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>>> >>> >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) >>>> >>> >>>> >>> >>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) >>>> >>> >>>> >>> >>>> >> >>>> > >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) >>>> >>> >>>> >>> >>>> >> >>>> > >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>> >>> java.lang.Thread.run(Thread.java:701) >>>> >>> Driver stacktrace: >>>> >>> at org.apache.spark.scheduler.DAGScheduler.org >>>> >>> >>>> >> >>>> > >>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >>>> >>> at >>>> >>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >>>> >>> at scala.Option.foreach(Option.scala:236) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) >>>> >>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) >>>> >>> at akka.actor.ActorCell.invoke(ActorCell.scala:456) >>>> >>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) >>>> >>> at akka.dispatch.Mailbox.run(Mailbox.scala:219) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >>>> >>> at >>>> >>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>> >>> at >>>> >>> >>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> >>> at >>>> >>> >>>> >> >>>> > >>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>> >>> >>>> >>> Best, >>>> >>> Mahesh Balija. >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov < >>>> [email protected]> >>>> >>> wrote: >>>> >>> >>>> >>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <[email protected] >>>> > >>>> >>> wrote: >>>> >>>> >>>> >>>>> Is anyone else nervous about ignoring this issue or relying on >>>> >>> non-build >>>> >>>>> (hand run) test driven transitive dependency checking. I hope >>>> someone >>>> >>>> else >>>> >>>>> will chime in. >>>> >>>>> >>>> >>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can >>>> we >>>> > set >>>> >>>> up >>>> >>>>> the build machine to do this? I’d feel better about eyeballing >>>> deps if >>>> >>> we >>>> >>>>> could have a TEST_MASTER automatically run during builds at >>>> Apache. >>>> >>> Maybe >>>> >>>>> the regular unit tests are OK for building locally ourselves. >>>> >>>>> >>>> >>>>>> >>>> >>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov < >>>> [email protected]> >>>> >>>>> wrote: >>>> >>>>>> >>>> >>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel < >>>> [email protected]> >>>> >>>>> wrote: >>>> >>>>>> >>>> >>>>>>> Maybe a more fundamental issue is that we don’t know for sure >>>> >>> whether >>>> >>>> we >>>> >>>>>>> have missing classes or not. The job.jar at least used the pom >>>> >>>>> dependencies >>>> >>>>>>> to guarantee every needed class was present. So the job.jar >>>> seems to >>>> >>>>> solve >>>> >>>>>>> the problem but may ship some unnecessary duplicate code, right? >>>> >>>>>>> >>>> >>>>>> >>>> >>>>>> No, as i wrote spark doesn't work with job jar format. Neither >>>> as it >>>> >>>>> turns >>>> >>>>>> out more recent hadoop MR btw. >>>> >>>>> >>>> >>>>> Not speaking literally of the format. Spark understands jars and >>>> maven >>>> >>>> can >>>> >>>>> build one from transitive dependencies. >>>> >>>>> >>>> >>>>>> >>>> >>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES >>>> to >>>> >>>>> startup >>>> >>>>>> tasks with all of it just on copy time). This is absolutely not >>>> the >>>> >>> way >>>> >>>>> to >>>> >>>>>> go with this. >>>> >>>>>> >>>> >>>>> >>>> >>>>> Lack of guarantee to load seems like a bigger problem than startup >>>> >>> time. >>>> >>>>> Clearly we can’t just ignore this. >>>> >>>>> >>>> >>>> >>>> >>>> Nope. given highly iterative nature and dynamic task allocation in >>>> this >>>> >>>> environment, one is looking to effects similar to Map Reduce. This >>>> is >>>> >> not >>>> >>>> the only reason why I never go to MR anymore, but that's one of >>>> main >>>> >>> ones. >>>> >>>> >>>> >>>> How about experiment: why don't you create assembly that copies ALL >>>> >>>> transitive dependencies in one folder, and then try to broadcast it >>>> > from >>>> >>>> single point (front end) to well... let's start with 20 machines. >>>> (of >>>> >>>> course we ideally want to into 10^3 ..10^4 range -- but why bother >>>> if >>>> > we >>>> >>>> can't do it for 20). >>>> >>>> >>>> >>>> Or, heck, let's try to simply parallel-copy it between too >>>> machines 20 >>>> >>>> times that are not collocated on the same subnet. >>>> >>>> >>>> >>>> >>>> >>>>>> >>>> >>>>>>> There may be any number of bugs waiting for the time we try >>>> running >>>> >>>> on a >>>> >>>>>>> node machine that doesn’t have some class in it’s classpath. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> No. Assuming any given method is tested on all its execution >>>> paths, >>>> >>>> there >>>> >>>>>> will be no bugs. The bugs of that sort will only appear if the >>>> user >>>> >>> is >>>> >>>>>> using algebra directly and calls something that is not on the >>>> path, >>>> >>>> from >>>> >>>>>> the closure. In which case our answer to this is the same as for >>>> the >>>> >>>>> solver >>>> >>>>>> methodology developers -- use customized SparkConf while creating >>>> >>>> context >>>> >>>>>> to include stuff you really want. >>>> >>>>>> >>>> >>>>>> Also another right answer to this is that we probably should >>>> >>> reasonably >>>> >>>>>> provide the toolset here. For example, all the stats stuff found >>>> in R >>>> >>>>> base >>>> >>>>>> and R stat packages so the user is not compelled to go >>>> non-native. >>>> >>>>>> >>>> >>>>>> >>>> >>>>> >>>> >>>>> Huh? this is not true. The one I ran into was found by calling >>>> >>> something >>>> >>>>> in math from something in math-scala. It led outside and you can >>>> >>>> encounter >>>> >>>>> such things even in algebra. In fact you have no idea if these >>>> >>> problems >>>> >>>>> exists except for the fact you have used it a lot personally. >>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>>> You ran it with your own code that never existed before. >>>> >>>> >>>> >>>> But there's difference between released Mahout code (which is what >>>> you >>>> >>> are >>>> >>>> working on) and the user code. Released code must run thru remote >>>> tests >>>> >>> as >>>> >>>> you suggested and thus guarantee there are no such problems with >>>> post >>>> >>>> release code. >>>> >>>> >>>> >>>> For users, we only can provide a way for them to load stuff that >>>> they >>>> >>>> decide to use. We don't have apriori knowledge what they will use. >>>> It >>>> > is >>>> >>>> the same thing that spark does, and the same thing that MR does, >>>> > doesn't >>>> >>>> it? >>>> >>>> >>>> >>>> Of course mahout should drop rigorously the stuff it doesn't load, >>>> from >>>> >>> the >>>> >>>> scala scope. No argue about that. In fact that's what i suggested >>>> as #1 >>>> >>>> solution. But there's nothing much to do here but to go dependency >>>> >>>> cleansing for math and spark code. Part of the reason there's so >>>> much >>>> > is >>>> >>>> because newer modules still bring in everything from mrLegacy. >>>> >>>> >>>> >>>> You are right in saying it is hard to guess what else dependencies >>>> are >>>> >> in >>>> >>>> the util/legacy code that are actually used. but that's not a >>>> >>> justification >>>> >>>> for brute force "copy them all" approach that virtually guarantees >>>> >>> ruining >>>> >>>> one of the foremost legacy issues this work intended to address. >>>> >>>> >>>> >>> >>>> >> >>>> >> >>>> > >>>> > >>>> > >>>> >>>> >>> >> >
