Well that's not entirely true -- you can in fact train in parallel on
different segments of your dataset, thereby creating an ensemble. Pair
the outputs with a classifier udf that knows how to take advantage of
that, and suddenly you have a massively parallel ETL engine that can
do ML as part of it's normal day-to-day.



On Mon, Mar 12, 2012 at 8:06 PM, Dmitriy Lyubimov <[email protected]> wrote:
> yes that's what i meant by "almost none". It would seem to me that pig
> vector it is technically a bridge between pig schema to some (and at
> the moment perhaps quite limited) Mahout functionality rather than
> something fundamentally leaning on Pig's own capability. It would seem
> to me for that workflow there's no fundamental reason to use Pig
> unless the training data is already easily accessible thru Pig schema
> for some reason (and pretty much nothing else OR the author finds it
> excruciatingly difficult to massage the prepped data using something
> else).
>
> Also, afaict it uses single reducer which goes back to the duality of
> the regression training: you can't do it in parallel (not with this
> tool anyway) but then you don't really need parallelism so much
> (provided it converges quickly)  and hence why Pig at all as a MR
> tool. Unless massaging phase is somehow much heavier than the training
> phase. it is quite possible i don't understand the proposed flow but
> motivation to use Pig in this case seems a little bit artificial to
> me. It is also possible one of motivations is to  piggy-back on
> elephant bird adapters there but then again it feels like going thru a
> lot of trouble to get things into Pig just for the sake of getting
> things into Pig.  We could just as easily vectorize the data using
> java job, it would seem to me, and run the training in the guts of the
> reducer.
>
> Pig is a great ETL & prep tool but IMO it lacks so many brevity
> constructs so much desired when working with multidimensional
> datasets. Esp. when quick prototyping and variety of approaches is
> desired.
>
> On Mon, Mar 12, 2012 at 4:31 PM, Dmitriy Ryaboy <[email protected]> wrote:
>> There's some not-so-public work we are doing at Twitter (vote for the
>> Hadoop Summit talk!) and also Ted Dunning's Mahout integration:
>> https://github.com/tdunning/pig-vector
>>
>> On Mon, Mar 12, 2012 at 1:02 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>> No known public good attempts known to me exist to put ML kind of
>>> stuff on top of pig . (well almost none). There are some statistical
>>> packages written at Yahoo but afaik they don't do directly what you
>>> need.
>>>
>>> Pig is somewhat excellent data prep pipeline, but IMO is not as
>>> excellent as something like R-Hadoop.
>>>
>>> Also depending on # of your predictors and training latency required,
>>> you may not need a map reduce at all to train something like
>>> stochastic gradient descent-based schemes. They converge way too fast
>>> to really take advantage of MR based methods (again, in most typical
>>> settings of # of predictors). If you do have a virtually unbounded
>>> number of predictors, you probably will need some techniques to reduce
>>> it anyway (such as feature hashing found in Mahout). So perhaps
>>> there's an easier way to do actual training other than using Pig.
>>>
>>> -d
>>>
>>> On Mon, Mar 12, 2012 at 12:21 AM, chethan <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> We want want to do Linear regression analysis to achieve Interpolation for
>>>> a set of values, using PIG Scripts.
>>>> Do we have any in-built functions to achieve this, if not how to achieve.
>>>>
>>>> Thanks & Regards
>>>> Chethan Prakash.

Reply via email to