On Dec 23, 2011, at 6:21 PM, Donald A. Smith wrote: > More on conversion from ARFF files: > > Looking at the code in MapBackedARFModel.java (below), each string in the > document is assigned a separate double (converted from an integer value). > Nominals are treated similarly: each possible nominal/symbolic value is > assigned an integer-valued double. > > When strings (or nominals) are converted to doubles, it seems to me that the > conversion adds additional irrelevant structure that I don't want. > Depending on the order in which the strings are added, the assigned doubles > will vary. Adjacent strings in the ordering will be close together in the > metric space/distance measure. For example, if "john" is 1, "bob" is 2, and > "nancy" is 3, then john is > closer to bob than to nancy. For nominals, that seems wrong. Most users > will probably really want three binary attributes: one for john, one for bob, > and one for nancy. >
We could perhaps use the SGD vector encoding stuff here? > Am I correct that representing nominals and strings as doubles (in a single > attribute) introduces distracting structure (distance relations)? Maybe I'm > missing something. > > What I may want is to create a different attribute for each possible value of > each component of the URL (counting from the left). Attribute component1_1 > through component1_k would be binary attributes representing the k possible > values in the first component of the URL. Similarly for component2_1, ... > Weka has its own utility class for converting string attributes > to nominal attributes. That might give me what I want, for path based > data. I'd need to preprocess the data. Or implement your own ARFFModel. > > For URLs I have additional structure: ordering on the URL components. But if > I just wanted to represent a document as an unordered bag-of-words, then each > possible string or nominal should become a separate binary attribute, > MapBackedARFFModel.java doesn't seem to do the right thing. We can patch this if you have an alternate implementation. > > Seems like a compressed binary format would be useful for representing such > attributes, unless you also needed a count. > > Thanks, Don > > --- On Wed, 12/21/11, Grant Ingersoll <[email protected]> wrote: > > > From: Grant Ingersoll <[email protected]> > Subject: Re: Will "mahout arff.vector" correctly convert string > attributes? > To: [email protected] > Date: Wednesday, December 21, 2011, 10:09 AM > > The javadocs on ARFFVectorIterable say: > * Attribute type handling: > * <ul> > * <li>Numeric -> As is</li> > * <li>Nominal -> ordinal(value) i.e. @attribute lumber > {'\'(-inf-0.5]\'','\'(0.5-inf)\''} > * will convert -inf-0.5 -> 0, and 0.5-inf -> 1</li> > * <li>Dates -> Convert to time as a long</li> > * <li>Strings -> Create a map of String -> long</li> > * </ul> > > The code for this is in MapBackedARFFModel which implements ARFFModel, so > I suspect if it doesn't do exactly as you wish, it can be overridden. > > On Dec 21, 2011, at 12:37 PM, Donald A. Smith wrote: > > > Weka's ARFF format allows string attrbutes. > > > > @ATTRIBUTE userName string > > > > Will "mahout arff.vector" correctly handle conversion from such strings > to vectors in such a way that the attribute will, effectively, be treated the > same as a nominal attribute? That is, will the set of strings be converted > into a set of nominal attributes (one for each possible string value)? > > > > @ATTRIBUTE userName {bob, fred, harry, jill, betsy, george, bill} > > > > In general, will I lose any information by using arff.vector? > > > > For date attributes, will mahout insert derived attributes (hour of > day, day of week)? I presume not and I presume I have to add them myself. > > > > Thanks, Don > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > > > > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
