On Dec 23, 2011, at 6:21 PM, Donald A. Smith wrote:

> More on conversion from ARFF files:
> 
> Looking at the code in MapBackedARFModel.java (below), each string in the 
> document is assigned a separate double (converted from an integer value).  
> Nominals are treated similarly: each possible nominal/symbolic value is 
> assigned an integer-valued double. 
> 
> When strings (or nominals) are converted to doubles, it seems to me that the 
> conversion adds additional irrelevant structure that I don't want.   
> Depending on the order in which the strings are added, the assigned doubles 
> will vary.     Adjacent strings in the ordering will be close together in the 
> metric space/distance measure.  For example, if "john" is 1, "bob" is 2, and 
> "nancy" is 3, then john is 
> closer to bob than to nancy.    For nominals, that seems wrong.    Most users 
> will probably really want three binary attributes: one for john, one for bob, 
> and one for nancy.
> 

We could perhaps use the SGD vector encoding stuff here?  

> Am I correct that representing nominals and strings as doubles (in a single 
> attribute) introduces distracting structure (distance relations)?  Maybe I'm 
> missing something.
> 
> What I may want is to create a different attribute for each possible value of 
> each component of the URL (counting from the left).   Attribute  component1_1 
> through component1_k  would be binary attributes representing the k possible 
> values in the first component of the URL. Similarly for component2_1, ...  
> Weka has its own utility class for converting string attributes 
> to nominal attributes. That might give me what I want, for path based 
> data. I'd need to preprocess the data.

Or implement your own ARFFModel.

> 
> For URLs I have additional structure: ordering on the URL components.  But if 
> I just wanted to represent a document as an unordered bag-of-words, then each 
> possible string or nominal should become a separate binary attribute,   
> MapBackedARFFModel.java doesn't seem to do the right thing.

We can patch this if you have an alternate implementation.

> 
> Seems like a compressed binary format would be useful for representing such 
> attributes, unless you also needed a count.
> 
>  Thanks, Don
> 
> --- On Wed, 12/21/11, Grant Ingersoll <[email protected]> wrote:
> 
> 
>     From: Grant Ingersoll <[email protected]>
>     Subject: Re: Will "mahout arff.vector" correctly convert string 
> attributes?
>     To: [email protected]
>     Date: Wednesday, December 21, 2011, 10:09 AM
> 
>     The javadocs on ARFFVectorIterable say:
>     * Attribute type handling:
>     * <ul>
>     * <li>Numeric -> As is</li>
>     * <li>Nominal -> ordinal(value) i.e. @attribute lumber 
> {'\'(-inf-0.5]\'','\'(0.5-inf)\''}
>     * will convert -inf-0.5 -> 0, and 0.5-inf -> 1</li>
>     * <li>Dates -> Convert to time as a long</li>
>     * <li>Strings -> Create a map of String -> long</li>
>     * </ul>
> 
>     The code for this is in MapBackedARFFModel which implements ARFFModel, so 
> I suspect if it doesn't do exactly as you wish, it can be overridden.
> 
>     On Dec 21, 2011, at 12:37 PM, Donald A. Smith wrote:
> 
>     > Weka's ARFF format allows string attrbutes.
>     >
>     >   @ATTRIBUTE userName string
>     >
>     > Will "mahout arff.vector" correctly handle conversion from such strings 
> to vectors in such a way that the attribute will, effectively, be treated the 
> same as a nominal attribute? That is, will the set of strings be converted 
> into a set of nominal attributes (one for each possible string value)?
>     >
>     >   @ATTRIBUTE userName {bob, fred, harry, jill, betsy, george, bill}
>     >
>     > In general, will I lose any information by using arff.vector?
>     >
>     > For date attributes, will mahout insert derived attributes (hour of 
> day, day of week)? I presume not and I presume I have to add them myself.
>     >
>     >  Thanks, Don
> 
>     --------------------------------------------
>     Grant Ingersoll
>     http://www.lucidimagination.com
> 
> 
> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com



Reply via email to