More on conversion from ARFF files: Looking at the code in MapBackedARFModel.java (below), each string in the document is assigned a separate double (converted from an integer value). Nominals are treated similarly: each possible nominal/symbolic value is assigned an integer-valued double.
When strings (or nominals) are converted to doubles, it seems to me that the conversion adds additional irrelevant structure that I don't want. Depending on the order in which the strings are added, the assigned doubles will vary. Adjacent strings in the ordering will be close together in the metric space/distance measure. For example, if "john" is 1, "bob" is 2, and "nancy" is 3, then john is closer to bob than to nancy. For nominals, that seems wrong. Most users will probably really want three binary attributes: one for john, one for bob, and one for nancy. Am I correct that representing nominals and strings as doubles (in a single attribute) introduces distracting structure (distance relations)? Maybe I'm missing something. What I may want is to create a different attribute for each possible value of each component of the URL (counting from the left). Attribute component1_1 through component1_k would be binary attributes representing the k possible values in the first component of the URL. Similarly for component2_1, ... Weka has its own utility class for converting string attributes to nominal attributes. That might give me what I want, for path based data. I'd need to preprocess the data. For URLs I have additional structure: ordering on the URL components. But if I just wanted to represent a document as an unordered bag-of-words, then each possible string or nominal should become a separate binary attribute, MapBackedARFFModel.java doesn't seem to do the right thing. Seems like a compressed binary format would be useful for representing such attributes, unless you also needed a count. Thanks, Don --- On Wed, 12/21/11, Grant Ingersoll <[email protected]> wrote: From: Grant Ingersoll <[email protected]> Subject: Re: Will "mahout arff.vector" correctly convert string attributes? To: [email protected] Date: Wednesday, December 21, 2011, 10:09 AM The javadocs on ARFFVectorIterable say: * Attribute type handling: * <ul> * <li>Numeric -> As is</li> * <li>Nominal -> ordinal(value) i.e. @attribute lumber {'\'(-inf-0.5]\'','\'(0.5-inf)\''} * will convert -inf-0.5 -> 0, and 0.5-inf -> 1</li> * <li>Dates -> Convert to time as a long</li> * <li>Strings -> Create a map of String -> long</li> * </ul> The code for this is in MapBackedARFFModel which implements ARFFModel, so I suspect if it doesn't do exactly as you wish, it can be overridden. On Dec 21, 2011, at 12:37 PM, Donald A. Smith wrote: > Weka's ARFF format allows string attrbutes. > > @ATTRIBUTE userName string > > Will "mahout arff.vector" correctly handle conversion from such strings to vectors in such a way that the attribute will, effectively, be treated the same as a nominal attribute? That is, will the set of strings be converted into a set of nominal attributes (one for each possible string value)? > > @ATTRIBUTE userName {bob, fred, harry, jill, betsy, george, bill} > > In general, will I lose any information by using arff.vector? > > For date attributes, will mahout insert derived attributes (hour of day, day of week)? I presume not and I presume I have to add them myself. > > Thanks, Don -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
