yes that's what i meant by "almost none". It would seem to me that pig
vector it is technically a bridge between pig schema to some (and at
the moment perhaps quite limited) Mahout functionality rather than
something fundamentally leaning on Pig's own capability. It would seem
to me for that workflow there's no fundamental reason to use Pig
unless the training data is already easily accessible thru Pig schema
for some reason (and pretty much nothing else OR the author finds it
excruciatingly difficult to massage the prepped data using something
else).

Also, afaict it uses single reducer which goes back to the duality of
the regression training: you can't do it in parallel (not with this
tool anyway) but then you don't really need parallelism so much
(provided it converges quickly)  and hence why Pig at all as a MR
tool. Unless massaging phase is somehow much heavier than the training
phase. it is quite possible i don't understand the proposed flow but
motivation to use Pig in this case seems a little bit artificial to
me. It is also possible one of motivations is to  piggy-back on
elephant bird adapters there but then again it feels like going thru a
lot of trouble to get things into Pig just for the sake of getting
things into Pig.  We could just as easily vectorize the data using
java job, it would seem to me, and run the training in the guts of the
reducer.

Pig is a great ETL & prep tool but IMO it lacks so many brevity
constructs so much desired when working with multidimensional
datasets. Esp. when quick prototyping and variety of approaches is
desired.

On Mon, Mar 12, 2012 at 4:31 PM, Dmitriy Ryaboy <[email protected]> wrote:
> There's some not-so-public work we are doing at Twitter (vote for the
> Hadoop Summit talk!) and also Ted Dunning's Mahout integration:
> https://github.com/tdunning/pig-vector
>
> On Mon, Mar 12, 2012 at 1:02 PM, Dmitriy Lyubimov <[email protected]> wrote:
>> No known public good attempts known to me exist to put ML kind of
>> stuff on top of pig . (well almost none). There are some statistical
>> packages written at Yahoo but afaik they don't do directly what you
>> need.
>>
>> Pig is somewhat excellent data prep pipeline, but IMO is not as
>> excellent as something like R-Hadoop.
>>
>> Also depending on # of your predictors and training latency required,
>> you may not need a map reduce at all to train something like
>> stochastic gradient descent-based schemes. They converge way too fast
>> to really take advantage of MR based methods (again, in most typical
>> settings of # of predictors). If you do have a virtually unbounded
>> number of predictors, you probably will need some techniques to reduce
>> it anyway (such as feature hashing found in Mahout). So perhaps
>> there's an easier way to do actual training other than using Pig.
>>
>> -d
>>
>> On Mon, Mar 12, 2012 at 12:21 AM, chethan <[email protected]> wrote:
>>> Hi,
>>>
>>> We want want to do Linear regression analysis to achieve Interpolation for
>>> a set of values, using PIG Scripts.
>>> Do we have any in-built functions to achieve this, if not how to achieve.
>>>
>>> Thanks & Regards
>>> Chethan Prakash.

Reply via email to