I am just going to give you some design intents in the existing code. as far as i can recollect, mahout context gives complete flexibility. You can control the behavior but various degrees of overriding the default behavior and doing more or less work on context setup on your own. (I assume we are talking specifically about sparkbindings).
By default, the mahoutSparkContext() helper of the sparkbindings package tries to locate the jars in whatever MAHOUT_HOME/bin/classpath -spark tells it. (btw this part can be rewritten much more elegantly and robustly with scala.sys.process._ capabilities of Scala; it's just this code is really more than 3 years old now and i was not that deep with Scala back then to know its shell DSL in such detail). the logic of MAHOUT-HOME/bin/classpath -spark is admittedly pretty convoluted and there are location variations between binary distribution and maven-built source locations. I can't say i understand the underlying structure or motivation for that structure very well there . (1) E.g. you can tell it to ignore automatically adding these jars to context and instead use your own algorithm to locate those (e.g. in Zeppelin home or something). You also can do it in more than one way: (1a) set addMahoutJars = false. the correct behavior should ignore requirement of MAHOUT_HOME then; and subsequently you can include necessary mahout jars could be supplied from your custom location in `customJars` parameter; (1b) or you can also set addMahoutJars=false and add them via supplied custom sparkConf (which is the base configuration for everything before mahout tries to add its own requirements to configuration). (2) finally, you can completely take over spark context creation and wrap already existing context into a mahout context via implicit (or explicit) conversion given in the same package, `sc2sdc`. E.g. you can do it implicity: import o.a.m.sparkbindings._ val mahoutContext:SparkDistributedContext = sparkContext // this is of type o.a.spark.SparkContext that's it. Note that in that case you have to take over on more work than just adjusting context JAR classpath. you will have to do all the customizations mahout does to context such as ensuring minimum requirements of kryo serialization (you can see the code what currently is enforced, but i think this is largely just the kryo serialization requirement). Now, if you want to do custom classpath: naturally you don't need all mahout jars. In case of spark backend execution, you need to filter to include only mahout-math, mahout-math-scala and mahout-spark. I am fairly sure that modern state of the project also requires mahout-spark-[blah]-dependency-reduced.jar to be redistributed to backend as well (which are minimum 3rd party shaded dependencies apparently engaged by some algorithms in the backend as well -- it used to be absent from backend requirements though). -d On Wed, Jun 1, 2016 at 7:47 AM, Trevor Grant <trevor.d.gr...@gmail.com> wrote: > I'm trying to refactor the Mahout dependency from the pom.xml of the Spark > interpreter (adding Mahout integration to Zeppelin) > > Assuming MAHOUT_HOME is available, I see that the jars in source build live > in a different place than the jars in the binary distribution. > > I'm to the point where I'm trying to come up with a good place to pick up > the required jars while allowing for: > 1. flexability in Mahout versions > 2. Not writing a huge block of code designed to scan several conceivable > places throughout the file system. > > One thought was to put the onus on the user to move the desired jars to a > local repo within the Zeppelin directory. > > Wanted to open up to input from users and dev as I consider this. > > Is documentation specifying which JARs need to be moved to a specific > directory and places you are likely to find them to much to ask of users? > > Other approaches? > > For background, Zeppelin starts a Spark Shell and we need to make sure all > of the required Mahout jars get loaded in the class path when spark starts. > The question is where do all of these JARs relatively live. > > Thanks for any feedback, > tg > > > > > > > Trevor Grant > Data Scientist > https://github.com/rawkintrevo > http://stackexchange.com/users/3002022/rawkintrevo > http://trevorgrant.org > > *"Fortunate is he, who is able to know the causes of things." -Virgil* >