Mahout's examples for clustering involve documents and bags of words. I want to cluster items that include tree-structured and temporal attributes.
The tree structured attributes are similar to Java package names (such as org.apache.hadoop.hdfs.security.token.delegation). Two packages should be considered close together if they share (long) prefixes. The temporal attributes have a periodic structure: Tuesday, Dec 12 at 2:13 PM is close to Tuesday, Dev 19 at 2:18 PM because they're both on Tuesdays, they're both on workdays, and they're both around 3:00 PM. Is mahout the right tool for clustering such items? I was thinking that I could convert package paths to a sequence of symbolic attributes: one for each position. But that would seem to lose information. And I could add derived attributes for times: day of week, time of day, etc. I can easily define a distance function on items. Can mahout cluster based just on a distance function? Thanks, Don
