On Sun, Dec 18, 2011 at 7:26 AM, Donald A. Smith <[email protected]>wrote:
> > I want to cluster HTTP requests (including web site visits). Each one has > a URL, a time, and some header fields. Assume there will be hundreds > millions of requests per week, so I can't store them all; I have to store > prototypes or examples. Does Mahout have support for that? > This just means you need to down-sample. > If each prefix of a URL is turned into a "word", and if the request > becomes a bag of such words, plus a time and some header fields, then one > problem I see is that there will be millions of columns per vector: one > for each possible URL prefix: google, google.com, news.google.com, > amazon, amazon.com, yahoo, yahoo.com, mail.yahoo.com, etc, etc. In other > words, clustering structured data like URLs doesn't seem to fit into the > use case of clustering documents. > > But maybe the "hashed feature encoders" that you mentioned solve this > problem. Could you please say more how they work? > See chapters 14 and 15 of Mahout in Action. See also https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/encoders/StaticWordValueEncoder.html https://cwiki.apache.org/MAHOUT/logistic-regression.html > > You suggest using different weights for the various fields and items, > depending on how important/common they are. Yes,but it seems that the > clustering algorithm should derive those weights itself, especially for the > numerous URL prefixes. > > Thanks > > --- On Fri, 12/16/11, Ted Dunning <[email protected]> wrote: > > From: Ted Dunning <[email protected]> > Subject: Re: Clustering with path-structured attributes and periodic > temporal attributes > To: [email protected] > Cc: [email protected] > Date: Friday, December 16, 2011, 5:22 PM > > Mahout should do fine on these data. Most of the clustering algorithms, > however, need more than just a distance function because they need to do > something like build a centroid. I will insert inline comments about how > you can get a space that has roughly the properties you need for cluster. > > Also, can you say a bit about how much data you have? If it is modest in > size, then you might do well using a more interactive tool like R. If you > have more than millions of items > > On Fri, Dec 16, 2011 at 4:43 PM, Donald A. Smith <[email protected] > >wrote: > > > Mahout's examples for clustering involve documents and bags of words. I > > want to cluster items that include tree-structured and temporal > attributes. > > > > The tree structured attributes are similar to Java package names (such as > > org.apache.hadoop.hdfs.security.token.delegation). Two packages should > > be considered close together if they share (long) prefixes. > > > > OK. The easy way to handle this is to consider each such attribute to > actually be a bag of all of the prefixes of the attribute. Thus, > java.math.Random would be translated to the bag containing > > java. > java.math. > java.math.Random > > > I would also weight each of these by how common it is. That would give > java. very low weight and com.tdunning. very high weight. > > The temporal attributes have a periodic structure: Tuesday, Dec 12 at > > 2:13 PM is close to Tuesday, Dev 19 at 2:18 PM because they're both > > on Tuesdays, they're both on workdays, and they're both around 3:00 PM. > > > > These can be encoded as multiple categorical variables: > > day-of-week > workdayQ > hour-of-day-round-down > hour-of-day-nearest > > I would typically add some continuous cyclic time variables as well: > > sin(pi * timeOfDay / lengthOfDay) > cos(pi * timeOfDay / lengthOfDay) > > This leaves you with one text-like variable (bagOfPrefixes), four > categorical variables, and two continuous variables. This is reasonably > close to what you suggested except that I would use prefix bags rather than > component bags and I would recommend weighting the components of the bags > by inverse document frequency. Another difference is the use of continuous > time quadrature variables. > > If you can encode these as a vector, then you can simply depend on the > Euclidean distance for clustering. The hashed feature encoders in Mahout > should fill this bill pretty nicely. I would try encoded vector sizes of > 100, 1000 and 10,000 and pick the smallest that works well for you. > > After encoding these items, try clustering a bit and examine the clusters. > For each variable, you should experiment a bit by adjusting the scaling on > each variable downward until you see clusters that have too much a mixture > of that variable. This will help you avoid the situation where one > coordinate dominates all the others. > > Another nice way to experiment with these data is to pour them into Solr > instance. Then do more-like-this queries to see if the nearby elements are > like what you want. You may need to use the function query to get the > continuous time variables to work or you may be able to fake out the > geospatial search capability by using the two variables to define longitude > and set the latitude to a constant value (how you do that depends on how > geospatial search is broken ... all geospatial searches are broken in > different ways). With SolR you wouldn't need to weight the prefix terms > ... it would do the right thing for you. >
