> > I'm not an evangelist for the maven-shade-plugin, but my very > unscientific impression is that people walk up to mahout and expect > the mahout command to just 'work'. Unless someone can unveil a way to > script the exploitation of the distributed cache, that means that the > jar file that the mahout command hands to the hadoop command has to > use the 'lib/' convention, and have the correct structure of raw and > lib-ed classes.
here is what i think : We require to setup MAHOUT_HOME. well, most hadoop project require something of the sort. then AbstractJob implements walking the lib tree and adding those paths (based on MAHOUT_HOME or otherwise derived knowledge of lib location) and throws all the jars there into backend path. all mahout projects do something similar. Where's the complexity in that? > > Further, any unsophisticated user who goes to incorporate Mahout into > a larger structure has to do likewise. Yes. There are two issues here; 1) client side api use. That should be fine as long as MAHOUT_HOME points to the right place. since user is not involved in writing driver code, we are golden. 2) backend side use of mahout? Not terribly expected, but maybe. E.g. if mahout allows to specify external strategies to do 'stuff' , such as external lucene analyzer in the seq2sparse, yes. In this case, well, we need to figure how to handle this ad-hoc thru command line. Let's look how other projects deal with the problem? Oh yes, they all implement their own custom mechanisms for these cases too! Such as : -- pig uses custom command register(jar) -- hive has auxlib folder in HIVE_HOME where it expects find user jars! Something similar should be good for us as a part of ecosystem, should it not?
