Mahout should do fine on these data.  Most of the clustering algorithms,
however, need more than just a distance function because they need to do
something like build a centroid.  I will insert inline comments about how
you can get a space that has roughly the properties you need for cluster.

Also, can you say a bit about how much data you have? If it is modest in
size, then you might do well using a more interactive tool like R.  If you
have more than millions of items

On Fri, Dec 16, 2011 at 4:43 PM, Donald A. Smith <[email protected]>wrote:

> Mahout's examples for clustering involve documents and bags of words.  I
> want to cluster items that include tree-structured and temporal attributes.
>
> The tree structured attributes are similar to Java package names (such as
> org.apache.hadoop.hdfs.security.token.delegation). Two packages should
> be considered close together if they share (long) prefixes.
>

OK.  The easy way to handle this is to consider each such attribute to
actually be a bag of all of the prefixes of the attribute.  Thus,
java.math.Random would be translated to the bag containing

java.
java.math.
java.math.Random


I would also weight each of these by how common it is.  That would give
java. very low weight and com.tdunning. very high weight.

The temporal attributes have a periodic structure:    Tuesday, Dec 12 at
> 2:13 PM   is close to  Tuesday, Dev 19 at 2:18 PM because they're both
> on Tuesdays, they're both on workdays, and they're both around 3:00 PM.
>

These can be encoded as multiple categorical variables:

    day-of-week
    workdayQ
    hour-of-day-round-down
    hour-of-day-nearest

I would typically add some continuous cyclic time variables as well:

    sin(pi * timeOfDay / lengthOfDay)
    cos(pi * timeOfDay / lengthOfDay)

This leaves you with one text-like variable (bagOfPrefixes), four
categorical variables, and two continuous variables.  This is reasonably
close to what you suggested except that I would use prefix bags rather than
component bags and I would recommend weighting the components of the bags
by inverse document frequency.  Another difference is the use of continuous
time quadrature variables.

If you can encode these as a vector, then you can simply depend on the
Euclidean distance for clustering.  The hashed feature encoders in Mahout
should fill this bill pretty nicely.  I would try encoded vector sizes of
100, 1000 and 10,000 and pick the smallest that works well for you.

After encoding these items, try clustering a bit and examine the clusters.
 For each variable, you should experiment a bit by adjusting the scaling on
each variable downward until you see clusters that have too much a mixture
of that variable.  This will help you avoid the situation where one
coordinate dominates all the others.

Another nice way to experiment with these data is to pour them into Solr
instance.  Then do more-like-this queries to see if the nearby elements are
like what you want.  You may need to use the function query to get the
continuous time variables to work or you may be able to fake out the
geospatial search capability by using the two variables to define longitude
and set the latitude to a constant value (how you do that depends on how
geospatial search is broken ... all geospatial searches are broken in
different ways).  With SolR you wouldn't need to weight the prefix terms
... it would do the right thing for you.

Reply via email to