Once more from the top. There is a hadoop convention. Is has nothing to do with the MANIFEST.MF as I read the code.
In the hadoop convention, if someone calls setJar on the job conf, the 'lib/' folder of the indicated jar will be unpacked and the jars in it added to the classpath on whatever nodes the job runs code on. If no one calls setJar, then the only thing in the classpath is the jar itself, unless you make other arrangements (as with the distributed cache). I'm not an evangelist for the maven-shade-plugin, but my very unscientific impression is that people walk up to mahout and expect the mahout command to just 'work'. Unless someone can unveil a way to script the exploitation of the distributed cache, that means that the jar file that the mahout command hands to the hadoop command has to use the 'lib/' convention, and have the correct structure of raw and lib-ed classes. Further, any unsophisticated user who goes to incorporate Mahout into a larger structure has to do likewise. We could avoid exciting uses of the shade plugin altogether if we didn't have these static methods that initialize jobs and call setJarByClass on themselves. However, I don't see that for 0.5 unless we want to push the schedule back and make a concerted effort. Further, I am concerned, based on Jake's remarks, that even following the hadoop lib/ convention correctly doesn't always work, and we have no diagnostic insight into the nature of the failure. So it seems at the instant as if our choices are to hold our noses and shade, or give up on a trivial command line that runs our jobs without a prerequisite of pushing the dependencies out into the cluster.
