Well that's not entirely true -- you can in fact train in parallel on different segments of your dataset, thereby creating an ensemble. Pair the outputs with a classifier udf that knows how to take advantage of that, and suddenly you have a massively parallel ETL engine that can do ML as part of it's normal day-to-day.
On Mon, Mar 12, 2012 at 8:06 PM, Dmitriy Lyubimov <[email protected]> wrote: > yes that's what i meant by "almost none". It would seem to me that pig > vector it is technically a bridge between pig schema to some (and at > the moment perhaps quite limited) Mahout functionality rather than > something fundamentally leaning on Pig's own capability. It would seem > to me for that workflow there's no fundamental reason to use Pig > unless the training data is already easily accessible thru Pig schema > for some reason (and pretty much nothing else OR the author finds it > excruciatingly difficult to massage the prepped data using something > else). > > Also, afaict it uses single reducer which goes back to the duality of > the regression training: you can't do it in parallel (not with this > tool anyway) but then you don't really need parallelism so much > (provided it converges quickly) and hence why Pig at all as a MR > tool. Unless massaging phase is somehow much heavier than the training > phase. it is quite possible i don't understand the proposed flow but > motivation to use Pig in this case seems a little bit artificial to > me. It is also possible one of motivations is to piggy-back on > elephant bird adapters there but then again it feels like going thru a > lot of trouble to get things into Pig just for the sake of getting > things into Pig. We could just as easily vectorize the data using > java job, it would seem to me, and run the training in the guts of the > reducer. > > Pig is a great ETL & prep tool but IMO it lacks so many brevity > constructs so much desired when working with multidimensional > datasets. Esp. when quick prototyping and variety of approaches is > desired. > > On Mon, Mar 12, 2012 at 4:31 PM, Dmitriy Ryaboy <[email protected]> wrote: >> There's some not-so-public work we are doing at Twitter (vote for the >> Hadoop Summit talk!) and also Ted Dunning's Mahout integration: >> https://github.com/tdunning/pig-vector >> >> On Mon, Mar 12, 2012 at 1:02 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> No known public good attempts known to me exist to put ML kind of >>> stuff on top of pig . (well almost none). There are some statistical >>> packages written at Yahoo but afaik they don't do directly what you >>> need. >>> >>> Pig is somewhat excellent data prep pipeline, but IMO is not as >>> excellent as something like R-Hadoop. >>> >>> Also depending on # of your predictors and training latency required, >>> you may not need a map reduce at all to train something like >>> stochastic gradient descent-based schemes. They converge way too fast >>> to really take advantage of MR based methods (again, in most typical >>> settings of # of predictors). If you do have a virtually unbounded >>> number of predictors, you probably will need some techniques to reduce >>> it anyway (such as feature hashing found in Mahout). So perhaps >>> there's an easier way to do actual training other than using Pig. >>> >>> -d >>> >>> On Mon, Mar 12, 2012 at 12:21 AM, chethan <[email protected]> wrote: >>>> Hi, >>>> >>>> We want want to do Linear regression analysis to achieve Interpolation for >>>> a set of values, using PIG Scripts. >>>> Do we have any in-built functions to achieve this, if not how to achieve. >>>> >>>> Thanks & Regards >>>> Chethan Prakash.
