Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Dmitriy Lyubimov Mon, 25 Apr 2011 15:31:34 -0700

I see.

I guess you mean nested preprocessors vs. pipelined jobs.


There are some efforts, e.g. Rapid Miner, that allows to do more than
just input normalization in a formal model -- although i did not play
enough with that. But they do *something*. perhaps it could be a
source for inspiration. Is Rapid Miner modelling  closer to what you
mean?



On Mon, Apr 25, 2011 at 12:14 PM, Ted Dunning <[email protected]> wrote:
> On Mon, Apr 25, 2011 at 12:04 PM, Dmitriy Lyubimov <[email protected]>wrote:
>
>> I don't think stuff like pre-clustering, dimensionality reduction
>> should be included. Just the summarization, hashing trick and common
>> strategies for parsing non-quantitative inputs included in the book.
>>
>
> So you prefer the limited function option.
>
>
>> ...
>> But if there's pre-clustering and/or dimensionality reduction (PCA
>> like stuff), that would be a pipeline, not just input processing? I
>> don't think about input processing as being a pipelined processing.
>>
>
> It isn't usually a pipeline as in map-reduce.  Yes, it is a set of pure
> functions applied to the input variables to produce the actual predictor
> variables.  Yes, these functions can be composed.
>
> If you are trying to do what Grant says (provide Mahout-as-a-service) then
> you need to provide some mechanism for adding these things.
>

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

Reply via email to