On Wed, May 2, 2012 at 11:06 AM, Timothy Potter <[email protected]>wrote:
> We're really keen on Ted's pig-vector project > (https://github.com/tdunning/pig-vector) as we're building a number of > classifiers on Mahout's SGD framework, with the bulk of our data being > in Cassandra processed almost entirely with Pig. We'd love to hear > about any planned features for the pig-vector project we can help out > on. Any similar Pig-Mahout projects we should know about? > The huge problem with pig-vector is that dependency on elephant bird makes it really almost impossible to build. Elephant bird has obscure dependencies on things like yaml-beans. That is a problem because the yaml-beans maintainer has a forceful way of expressing his distaste for all things to do with Maven and thus refuses to publish any artifacts in standard ways. Actually, the maintainer has a rather forceful manner that he applies to all interactions as far as I can tell. On the other hand, the necessary capabilities that pig-vector needs from Elephant bird are quite minor and could probably be reasonably extract. I am under-water, however, and thus cannot finish that right away. I can and will assist anybody who has the necessary time and enthusiasm. This might make a very nice pig day effort. > In general, we're reaching out today to see who else in the community > is interested in better Pig / Mahout integration and what types of > challenges they're facing? Any cool UDFs you'd like to share? > Praneet at UCI ([email protected]) has been doing some interesting work here to do with feature sharding in pig. Perhaps he can speak up.
