On Sun, Dec 18, 2011 at 7:26 AM, Donald A. Smith <[email protected]>wrote:

>
> I want to cluster HTTP requests (including web site visits). Each one has
> a URL, a time, and some header fields.    Assume there will be hundreds
> millions of requests per week, so I can't store them all; I have to store
> prototypes or examples.  Does Mahout have support for that?
>

This just means you need to down-sample.


> If each prefix of a URL is turned into a "word", and if the request
> becomes a bag of such words, plus a time and some header fields, then one
> problem I see is that there will be millions of columns per vector:  one
> for each possible URL prefix:   google, google.com, news.google.com,
> amazon, amazon.com, yahoo, yahoo.com, mail.yahoo.com, etc, etc.  In other
> words, clustering structured data like URLs doesn't seem to fit into the
> use case of clustering documents.
>
> But maybe the "hashed feature encoders" that you mentioned solve this
> problem. Could you please say more how they work?
>

See chapters 14 and 15 of Mahout in Action.

See also

https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/encoders/StaticWordValueEncoder.html

https://cwiki.apache.org/MAHOUT/logistic-regression.html




>
> You suggest using different weights for the various fields and items,
> depending on how important/common they are.  Yes,but it seems that the
> clustering algorithm should derive those weights itself, especially for the
> numerous URL prefixes.
>
>   Thanks
>
> --- On Fri, 12/16/11, Ted Dunning <[email protected]> wrote:
>
> From: Ted Dunning <[email protected]>
> Subject: Re: Clustering with path-structured attributes and periodic
> temporal attributes
> To: [email protected]
> Cc: [email protected]
> Date: Friday, December 16, 2011, 5:22 PM
>
> Mahout should do fine on these data.  Most of the clustering algorithms,
> however, need more than just a distance function because they need to do
> something like build a centroid.  I will insert inline comments about how
> you can get a space that has roughly the properties you need for cluster.
>
> Also, can you say a bit about how much data you have? If it is modest in
> size, then you might do well using a more interactive tool like R.  If you
> have more than millions of items
>
> On Fri, Dec 16, 2011 at 4:43 PM, Donald A. Smith <[email protected]
> >wrote:
>
> > Mahout's examples for clustering involve documents and bags of words.  I
> > want to cluster items that include tree-structured and temporal
> attributes.
> >
> > The tree structured attributes are similar to Java package names (such as
> > org.apache.hadoop.hdfs.security.token.delegation). Two packages should
> > be considered close together if they share (long) prefixes.
> >
>
> OK.  The easy way to handle this is to consider each such attribute to
> actually be a bag of all of the prefixes of the attribute.  Thus,
> java.math.Random would be translated to the bag containing
>
> java.
> java.math.
> java.math.Random
>
>
> I would also weight each of these by how common it is.  That would give
> java. very low weight and com.tdunning. very high weight.
>
> The temporal attributes have a periodic structure:    Tuesday, Dec 12 at
> > 2:13 PM   is close to  Tuesday, Dev 19 at 2:18 PM because they're both
> > on Tuesdays, they're both on workdays, and they're both around 3:00 PM.
> >
>
> These can be encoded as multiple categorical variables:
>
>     day-of-week
>     workdayQ
>     hour-of-day-round-down
>     hour-of-day-nearest
>
> I would typically add some continuous cyclic time variables as well:
>
>     sin(pi * timeOfDay / lengthOfDay)
>     cos(pi * timeOfDay / lengthOfDay)
>
> This leaves you with one text-like variable (bagOfPrefixes), four
> categorical variables, and two continuous variables.  This is reasonably
> close to what you suggested except that I would use prefix bags rather than
> component bags and I would recommend weighting the components of the bags
> by inverse document frequency.  Another difference is the use of continuous
> time quadrature variables.
>
> If you can encode these as a vector, then you can simply depend on the
> Euclidean distance for clustering.  The hashed feature encoders in Mahout
> should fill this bill pretty nicely.  I would try encoded vector sizes of
> 100, 1000 and 10,000 and pick the smallest that works well for you.
>
> After encoding these items, try clustering a bit and examine the clusters.
>  For each variable, you should experiment a bit by adjusting the scaling on
> each variable downward until you see clusters that have too much a mixture
> of that variable.  This will help you avoid the situation where one
> coordinate dominates all the others.
>
> Another nice way to experiment with these data is to pour them into Solr
> instance.  Then do more-like-this queries to see if the nearby elements are
> like what you want.  You may need to use the function query to get the
> continuous time variables to work or you may be able to fake out the
> geospatial search capability by using the two variables to define longitude
> and set the latitude to a constant value (how you do that depends on how
> geospatial search is broken ... all geospatial searches are broken in
> different ways).  With SolR you wouldn't need to weight the prefix terms
> ... it would do the right thing for you.
>

Reply via email to