I am not sure i see the difficulty but it is possible we are talking about slightly different things. Hadoop solves this stuff thru some pluggable strategies, such as InputFormat .
Those strategies are paramerized (and also perhaps persisted) thru some form of declarative definitions (if we keep analogy with hadoop, they use Configuration stuff for serializing something like that -- but of course property based definitions are probably quite underwhelming for this case). Similarly, Lucene defines Analyzer preprocessing strategies. Surely, we could probably define some strategies handling rows of re-standardized inputs producing vectorized and standardized inputs as a result. A little bit bigger Q is what to use for pre-vectorized inputs as Vector obviously won't handle various datatypes esp. qualitative inputs. But perhaps we already have some of this, i am not sure. I saw a fare amount of classes that adapt various formats (what was it? TSV? ARFF?), perhaps we could we strategize that as well. On Fri, Apr 22, 2011 at 9:10 AM, Ted Dunning <[email protected]> wrote: > Yes. > > But how do we specify the input? And how do we specify the encodings? > > This is what has always held me back in the past. Should we just allow > classes to be specified on the command line? > > On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <[email protected]> wrote: > >> Maybe there's a space for Mr based input conversion job indeed as a command >> line routine? I was kind of thinking about the same. Maybe even along with >> standartisation of the values. Some formal definition of inputs being fed >> to >> it. >> >
