I don't think stuff like pre-clustering, dimensionality reduction should be included. Just the summarization, hashing trick and common strategies for parsing non-quantitative inputs included in the book. In addition we might leave a space to writing custom strategies (pretty much like hadoop leaves room for writing custom input formats).
But if there's pre-clustering and/or dimensionality reduction (PCA like stuff), that would be a pipeline, not just input processing? I don't think about input processing as being a pipelined processing. On Mon, Apr 25, 2011 at 11:16 AM, Ted Dunning <[email protected]> wrote: > The difficulty is that vectorization often incorporates various kinds of > interpretation of the original data. This can involve the nesting of field > access, parsing, textual analysis as well as the basic vector encoding. It > may involve running a classifier (possibly derived by clustering) on some > inputs to produce an input variable. > > How to specify this in full generality is a difficult problem. > > The complementary problem is how to restrict what you can do, but allow > sufficient generality to meet most needs. That is a hard problem as well. > > It may be that the solution is to just provide simple examples and tell > people to write some Java (implements DataEncoder). That isn't all bad. > > On Mon, Apr 25, 2011 at 10:56 AM, Dmitriy Lyubimov <[email protected]>wrote: > >> I am not sure i see the difficulty but it is possible we are talking >> about slightly different things. >> Hadoop solves this stuff thru some pluggable strategies, such as >> InputFormat . >> >> Those strategies are paramerized (and also perhaps persisted) thru >> some form of declarative definitions (if we keep analogy with hadoop, >> they use Configuration stuff for serializing something like that -- >> but of course property based definitions are probably quite >> underwhelming for this case). Similarly, Lucene defines Analyzer >> preprocessing strategies. Surely, we could probably define some >> strategies handling rows of re-standardized inputs producing >> vectorized and standardized inputs as a result. >> >> A little bit bigger Q is what to use for pre-vectorized inputs as >> Vector obviously won't handle various datatypes esp. qualitative >> inputs. >> >> But perhaps we already have some of this, i am not sure. I saw a fare >> amount of classes that adapt various formats (what was it? TSV? >> ARFF?), perhaps we could we strategize that as well. >> >> On Fri, Apr 22, 2011 at 9:10 AM, Ted Dunning <[email protected]> >> wrote: >> > Yes. >> > >> > But how do we specify the input? And how do we specify the encodings? >> > >> > This is what has always held me back in the past. Should we just allow >> > classes to be specified on the command line? >> > >> > On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <[email protected]> >> wrote: >> > >> >> Maybe there's a space for Mr based input conversion job indeed as a >> command >> >> line routine? I was kind of thinking about the same. Maybe even along >> with >> >> standartisation of the values. Some formal definition of inputs being >> fed >> >> to >> >> it. >> >> >> > >> >
