Thanks Ted! Removing the elephant-bird dependency / build problems sounds like a good task we should include in our plans for the hackday ... what are your thoughts on adding pig-vector to Mahout as a contrib module? Do you want to keep it separate or eventually make its way into the project?
Praneet - thanks for throwing your hat in ;-) Sounds like you're doing some interesting things with Mahout and Pig already. Will definitely keep you in the loop as we work out the details ... Cheers, Tim On Wed, May 2, 2012 at 1:43 PM, praneet mhatre <[email protected]> wrote: > On Wed, May 2, 2012 at 11:13 AM, Ted Dunning <[email protected]> wrote: > >> On Wed, May 2, 2012 at 11:06 AM, Timothy Potter <[email protected] >> >wrote: >> >> > We're really keen on Ted's pig-vector project >> > (https://github.com/tdunning/pig-vector) as we're building a number of >> > classifiers on Mahout's SGD framework, with the bulk of our data being >> > in Cassandra processed almost entirely with Pig. We'd love to hear >> > about any planned features for the pig-vector project we can help out >> > on. Any similar Pig-Mahout projects we should know about? >> > >> >> The huge problem with pig-vector is that dependency on elephant bird makes >> it really almost impossible to build. Elephant bird has obscure >> dependencies on things like yaml-beans. That is a problem because the >> yaml-beans maintainer has a forceful way of expressing his distaste for all >> things to do with Maven and thus refuses to publish any artifacts in >> standard ways. Actually, the maintainer has a rather forceful manner that >> he applies to all interactions as far as I can tell. >> >> On the other hand, the necessary capabilities that pig-vector needs from >> Elephant bird are quite minor and could probably be reasonably extract. I >> am under-water, however, and thus cannot finish that right away. I can and >> will assist anybody who has the necessary time and enthusiasm. This might >> make a very nice pig day effort. >> >> >> > In general, we're reaching out today to see who else in the community >> > is interested in better Pig / Mahout integration and what types of >> > challenges they're facing? Any cool UDFs you'd like to share? >> > >> >> Praneet at UCI ([email protected]) has been doing some interesting >> work here to do with feature sharding in pig. Perhaps he can speak up. >> > > Hello Timothy, > > I have tried writing sharded versions of classifiers and they seem to work > well. But my code requires a pre-processing step before the classification > and re-aggregation of results (which was easy when I worked with Weka). > However, to be able to do the same in Mahout, I need something like > pig-vector to take of the pre-processing part. > > So yes, I am very interested in Pig / Mahout integration! But admittedly I > only have introductory knowledge of Pig. And as far the integration part > goes, my contribution so far has been limited to testing the stuff Ted has > written. > > But the idea of Pig-Mahout hackday sounds great! And I would definitely > like to be involved in it. > > > > -- > Praneet Mhatre > Graduate Student > Donald Bren School of ICS > University of California, Irvine
