yes that's what i meant by "almost none". It would seem to me that pig vector it is technically a bridge between pig schema to some (and at the moment perhaps quite limited) Mahout functionality rather than something fundamentally leaning on Pig's own capability. It would seem to me for that workflow there's no fundamental reason to use Pig unless the training data is already easily accessible thru Pig schema for some reason (and pretty much nothing else OR the author finds it excruciatingly difficult to massage the prepped data using something else).
Also, afaict it uses single reducer which goes back to the duality of the regression training: you can't do it in parallel (not with this tool anyway) but then you don't really need parallelism so much (provided it converges quickly) and hence why Pig at all as a MR tool. Unless massaging phase is somehow much heavier than the training phase. it is quite possible i don't understand the proposed flow but motivation to use Pig in this case seems a little bit artificial to me. It is also possible one of motivations is to piggy-back on elephant bird adapters there but then again it feels like going thru a lot of trouble to get things into Pig just for the sake of getting things into Pig. We could just as easily vectorize the data using java job, it would seem to me, and run the training in the guts of the reducer. Pig is a great ETL & prep tool but IMO it lacks so many brevity constructs so much desired when working with multidimensional datasets. Esp. when quick prototyping and variety of approaches is desired. On Mon, Mar 12, 2012 at 4:31 PM, Dmitriy Ryaboy <[email protected]> wrote: > There's some not-so-public work we are doing at Twitter (vote for the > Hadoop Summit talk!) and also Ted Dunning's Mahout integration: > https://github.com/tdunning/pig-vector > > On Mon, Mar 12, 2012 at 1:02 PM, Dmitriy Lyubimov <[email protected]> wrote: >> No known public good attempts known to me exist to put ML kind of >> stuff on top of pig . (well almost none). There are some statistical >> packages written at Yahoo but afaik they don't do directly what you >> need. >> >> Pig is somewhat excellent data prep pipeline, but IMO is not as >> excellent as something like R-Hadoop. >> >> Also depending on # of your predictors and training latency required, >> you may not need a map reduce at all to train something like >> stochastic gradient descent-based schemes. They converge way too fast >> to really take advantage of MR based methods (again, in most typical >> settings of # of predictors). If you do have a virtually unbounded >> number of predictors, you probably will need some techniques to reduce >> it anyway (such as feature hashing found in Mahout). So perhaps >> there's an easier way to do actual training other than using Pig. >> >> -d >> >> On Mon, Mar 12, 2012 at 12:21 AM, chethan <[email protected]> wrote: >>> Hi, >>> >>> We want want to do Linear regression analysis to achieve Interpolation for >>> a set of values, using PIG Scripts. >>> Do we have any in-built functions to achieve this, if not how to achieve. >>> >>> Thanks & Regards >>> Chethan Prakash.
