The difficulty is that vectorization often incorporates various kinds of interpretation of the original data. This can involve the nesting of field access, parsing, textual analysis as well as the basic vector encoding. It may involve running a classifier (possibly derived by clustering) on some inputs to produce an input variable.
How to specify this in full generality is a difficult problem. The complementary problem is how to restrict what you can do, but allow sufficient generality to meet most needs. That is a hard problem as well. It may be that the solution is to just provide simple examples and tell people to write some Java (implements DataEncoder). That isn't all bad. On Mon, Apr 25, 2011 at 10:56 AM, Dmitriy Lyubimov <[email protected]>wrote: > I am not sure i see the difficulty but it is possible we are talking > about slightly different things. > Hadoop solves this stuff thru some pluggable strategies, such as > InputFormat . > > Those strategies are paramerized (and also perhaps persisted) thru > some form of declarative definitions (if we keep analogy with hadoop, > they use Configuration stuff for serializing something like that -- > but of course property based definitions are probably quite > underwhelming for this case). Similarly, Lucene defines Analyzer > preprocessing strategies. Surely, we could probably define some > strategies handling rows of re-standardized inputs producing > vectorized and standardized inputs as a result. > > A little bit bigger Q is what to use for pre-vectorized inputs as > Vector obviously won't handle various datatypes esp. qualitative > inputs. > > But perhaps we already have some of this, i am not sure. I saw a fare > amount of classes that adapt various formats (what was it? TSV? > ARFF?), perhaps we could we strategize that as well. > > On Fri, Apr 22, 2011 at 9:10 AM, Ted Dunning <[email protected]> > wrote: > > Yes. > > > > But how do we specify the input? And how do we specify the encodings? > > > > This is what has always held me back in the past. Should we just allow > > classes to be specified on the command line? > > > > On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > > >> Maybe there's a space for Mr based input conversion job indeed as a > command > >> line routine? I was kind of thinking about the same. Maybe even along > with > >> standartisation of the values. Some formal definition of inputs being > fed > >> to > >> it. > >> > > >
