I don't think stuff like pre-clustering, dimensionality reduction
should be included. Just the summarization, hashing trick and common
strategies for parsing non-quantitative inputs included in the book.
In addition we might leave a space to writing custom strategies
(pretty much like hadoop leaves room for writing custom input
formats).

But if there's pre-clustering and/or dimensionality reduction (PCA
like stuff), that would be a pipeline, not just input processing? I
don't think about input processing as being a pipelined processing.

On Mon, Apr 25, 2011 at 11:16 AM, Ted Dunning <[email protected]> wrote:
> The difficulty is that vectorization often incorporates various kinds of
> interpretation of the original data.  This can involve the nesting of field
> access, parsing, textual analysis as well as the basic vector encoding.  It
> may involve running a classifier (possibly derived by clustering) on some
> inputs to produce an input variable.
>
> How to specify this in full generality is a difficult problem.
>
> The complementary problem is how to restrict what you can do, but allow
> sufficient generality to meet most needs.  That is a hard problem as well.
>
> It may be that the solution is to just provide simple examples and tell
> people to write some Java (implements DataEncoder).  That isn't all bad.
>
> On Mon, Apr 25, 2011 at 10:56 AM, Dmitriy Lyubimov <[email protected]>wrote:
>
>> I am not sure i see the difficulty but it is possible we are talking
>> about slightly different things.
>> Hadoop solves this stuff thru some pluggable strategies, such as
>> InputFormat .
>>
>> Those strategies are paramerized (and also perhaps persisted) thru
>> some form of declarative definitions (if we keep analogy with hadoop,
>> they use Configuration stuff for serializing something like that --
>> but of course property based definitions are probably quite
>> underwhelming for this case). Similarly, Lucene defines Analyzer
>> preprocessing strategies. Surely, we could probably define some
>> strategies handling rows of re-standardized inputs producing
>> vectorized and standardized inputs as a result.
>>
>> A little bit bigger Q is what to use for pre-vectorized inputs as
>> Vector obviously won't handle various datatypes esp. qualitative
>> inputs.
>>
>> But perhaps we already have some of this, i am not sure. I saw a fare
>> amount of classes that adapt various formats (what was it? TSV?
>> ARFF?), perhaps we could we strategize that as well.
>>
>> On Fri, Apr 22, 2011 at 9:10 AM, Ted Dunning <[email protected]>
>> wrote:
>> > Yes.
>> >
>> > But how do we specify the input?  And how do we specify the encodings?
>> >
>> > This is what has always held me back in the past.  Should we just allow
>> > classes to be specified on the command line?
>> >
>> > On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>> >
>> >> Maybe there's a space for Mr based input conversion job indeed as a
>> command
>> >> line routine? I was kind of thinking about the same. Maybe even along
>> with
>> >> standartisation of the values. Some formal definition of inputs being
>> fed
>> >> to
>> >> it.
>> >>
>> >
>>
>

Reply via email to