Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Ted Dunning Mon, 25 Apr 2011 12:14:57 -0700

On Mon, Apr 25, 2011 at 12:04 PM, Dmitriy Lyubimov <[email protected]>wrote:


> I don't think stuff like pre-clustering, dimensionality reduction
> should be included. Just the summarization, hashing trick and common
> strategies for parsing non-quantitative inputs included in the book.
>

So you prefer the limited function option.


> ...
> But if there's pre-clustering and/or dimensionality reduction (PCA
> like stuff), that would be a pipeline, not just input processing? I
> don't think about input processing as being a pipelined processing.
>

It isn't usually a pipeline as in map-reduce.  Yes, it is a set of pure
functions applied to the input variables to produce the actual predictor
variables.  Yes, these functions can be composed.

If you are trying to do what Grant says (provide Mahout-as-a-service) then
you need to provide some mechanism for adding these things.

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Reply via email to