Thanks for the suggestions, Ted. I want to cluster HTTP requests (including web site visits). Each one has a URL, a time, and some header fields. Assume there will be hundreds millions of requests per week, so I can't store them all; I have to store prototypes or examples. Does Mahout have support for that?
If each prefix of a URL is turned into a "word", and if the request becomes a bag of such words, plus a time and some header fields, then one problem I see is that there will be millions of columns per vector: one for each possible URL prefix: google, google.com, news.google.com, amazon, amazon.com, yahoo, yahoo.com, mail.yahoo.com, etc, etc. In other words, clustering structured data like URLs doesn't seem to fit into the use case of clustering documents. But maybe the "hashed feature encoders" that you mentioned solve this problem. Could you please say more how they work? You suggest using different weights for the various fields and items, depending on how important/common they are. Yes,but it seems that the clustering algorithm should derive those weights itself, especially for the numerous URL prefixes. Thanks --- On Fri, 12/16/11, Ted Dunning <[email protected]> wrote: From: Ted Dunning <[email protected]> Subject: Re: Clustering with path-structured attributes and periodic temporal attributes To: [email protected] Cc: [email protected] Date: Friday, December 16, 2011, 5:22 PM Mahout should do fine on these data. Most of the clustering algorithms, however, need more than just a distance function because they need to do something like build a centroid. I will insert inline comments about how you can get a space that has roughly the properties you need for cluster. Also, can you say a bit about how much data you have? If it is modest in size, then you might do well using a more interactive tool like R. If you have more than millions of items On Fri, Dec 16, 2011 at 4:43 PM, Donald A. Smith <[email protected]>wrote: > Mahout's examples for clustering involve documents and bags of words. I > want to cluster items that include tree-structured and temporal attributes. > > The tree structured attributes are similar to Java package names (such as > org.apache.hadoop.hdfs.security.token.delegation). Two packages should > be considered close together if they share (long) prefixes. > OK. The easy way to handle this is to consider each such attribute to actually be a bag of all of the prefixes of the attribute. Thus, java.math.Random would be translated to the bag containing java. java.math. java.math.Random I would also weight each of these by how common it is. That would give java. very low weight and com.tdunning. very high weight. The temporal attributes have a periodic structure: Tuesday, Dec 12 at > 2:13 PM is close to Tuesday, Dev 19 at 2:18 PM because they're both > on Tuesdays, they're both on workdays, and they're both around 3:00 PM. > These can be encoded as multiple categorical variables: day-of-week workdayQ hour-of-day-round-down hour-of-day-nearest I would typically add some continuous cyclic time variables as well: sin(pi * timeOfDay / lengthOfDay) cos(pi * timeOfDay / lengthOfDay) This leaves you with one text-like variable (bagOfPrefixes), four categorical variables, and two continuous variables. This is reasonably close to what you suggested except that I would use prefix bags rather than component bags and I would recommend weighting the components of the bags by inverse document frequency. Another difference is the use of continuous time quadrature variables. If you can encode these as a vector, then you can simply depend on the Euclidean distance for clustering. The hashed feature encoders in Mahout should fill this bill pretty nicely. I would try encoded vector sizes of 100, 1000 and 10,000 and pick the smallest that works well for you. After encoding these items, try clustering a bit and examine the clusters. For each variable, you should experiment a bit by adjusting the scaling on each variable downward until you see clusters that have too much a mixture of that variable. This will help you avoid the situation where one coordinate dominates all the others. Another nice way to experiment with these data is to pour them into Solr instance. Then do more-like-this queries to see if the nearby elements are like what you want. You may need to use the function query to get the continuous time variables to work or you may be able to fake out the geospatial search capability by using the two variables to define longitude and set the latitude to a constant value (how you do that depends on how geospatial search is broken ... all geospatial searches are broken in different ways). With SolR you wouldn't need to weight the prefix terms ... it would do the right thing for you.
