Re: Clustering with path-structured attributes and periodic temporal attributes

Donald A. Smith Sun, 18 Dec 2011 07:27:01 -0800

Thanks for the suggestions, Ted.

I want to cluster HTTP requests (including web site visits). Each one has a 
URL, a time, and some header fields.    Assume there will be hundreds millions 
of requests per week, so I can't store them all; I have to store prototypes or 
examples.  Does Mahout have support for that?

If each prefix of a URL is turned into a "word", and if the request becomes a 
bag of such words, plus a time and some header fields, then one problem I see 
is that there will be millions of columns per vector:  one for each possible 
URL prefix:   google, google.com, news.google.com, amazon, amazon.com, yahoo, 
yahoo.com, mail.yahoo.com, etc, etc.  In other words, clustering structured 
data like URLs doesn't seem to fit into the use case of clustering documents.

But maybe the "hashed feature encoders" that you mentioned solve this problem. 
Could you please say more how they work?

You suggest using different weights for the various fields and items, depending 
on how important/common they are.  Yes,but it seems that the clustering 
algorithm should derive those weights itself, especially for the numerous URL 
prefixes.

  Thanks

--- On Fri, 12/16/11, Ted Dunning <[email protected]> wrote:

From: Ted Dunning <[email protected]>
Subject: Re: Clustering with path-structured attributes and periodic temporal 
attributes
To: [email protected]
Cc: [email protected]
Date: Friday, December 16, 2011, 5:22 PM

Mahout should do fine on these data.  Most of the clustering algorithms,
however, need more than just a distance function because they need to do
something like build a centroid.  I will insert inline comments about how
you can get a space that has roughly the properties you need for cluster.

Also, can you say a bit about how much data you have? If it is modest in
size, then you might do well using a more interactive tool like R.  If you
have more than millions of items

On Fri, Dec 16, 2011 at 4:43 PM, Donald A. Smith <[email protected]>wrote:

> Mahout's examples for clustering involve documents and bags of words.  I
> want to cluster items that include tree-structured and temporal attributes.
>
> The tree structured attributes are similar to Java package names (such as
> org.apache.hadoop.hdfs.security.token.delegation). Two packages should
> be considered close together if they share (long) prefixes.
>

OK.  The easy way to handle this is to consider each such attribute to
actually be a bag of all of the prefixes of the attribute.  Thus,
java.math.Random would be translated to the bag containing

java.
java.math.
java.math.Random

I would also weight each of these by how common it is.  That would give
java. very low weight and com.tdunning. very high weight.

The temporal attributes have a periodic structure:    Tuesday, Dec 12 at
> 2:13 PM   is close to  Tuesday, Dev 19 at 2:18 PM because they're both
> on Tuesdays, they're both on workdays, and they're both around 3:00 PM.
>

These can be encoded as multiple categorical variables:

    day-of-week
    workdayQ
    hour-of-day-round-down
    hour-of-day-nearest

I would typically add some continuous cyclic time variables as well:

    sin(pi * timeOfDay / lengthOfDay)
    cos(pi * timeOfDay / lengthOfDay)

This leaves you with one text-like variable (bagOfPrefixes), four
categorical variables, and two continuous variables.  This is reasonably
close to what you suggested except that I would use prefix bags rather than
component bags and I would recommend weighting the components of the bags
by inverse document frequency.  Another difference is the use of continuous
time quadrature variables.

If you can encode these as a vector, then you can simply depend on the
Euclidean distance for clustering.  The hashed feature encoders in Mahout
should fill this bill pretty nicely.  I would try encoded vector sizes of
100, 1000 and 10,000 and pick the smallest that works well for you.

After encoding these items, try clustering a bit and examine the clusters.
 For each variable, you should experiment a bit by adjusting the scaling on
each variable downward until you see clusters that have too much a mixture
of that variable.  This will help you avoid the situation where one
coordinate dominates all the others.

Another nice way to experiment with these data is to pour them into Solr
instance.  Then do more-like-this queries to see if the nearby elements are
like what you want.  You may need to use the function query to get the
continuous time variables to work or you may be able to fake out the
geospatial search capability by using the two variables to define longitude
and set the latitude to a constant value (how you do that depends on how
geospatial search is broken ... all geospatial searches are broken in
different ways).  With SolR you wouldn't need to weight the prefix terms
... it would do the right thing for you.

Re: Clustering with path-structured attributes and periodic temporal attributes

Reply via email to