So this isn't really categorical data. But that is good news.
You can still use the binary representation and there is a good possibility that these data will cluster reasonably, especially with spectral techniques. What I would recommend, however, is that cooccurrence analysis might give you a better view of things. On Mon, May 6, 2013 at 11:20 AM, Florents Tselai <[email protected]>wrote: > I'm working on Market Basket Analysis. > The "small" data sets consists of 40000 transactions (baskets) and 35 > categories. > While the large data sets is about 30million baskets and 400 categories. > > > On Mon, May 6, 2013 at 9:17 PM, Ted Dunning <[email protected]> wrote: > > > It really depends on your data, but anything that works on text has at > > least a potential for working on categorical data. > > > > It is common to use a 1-of-n encoding for categorical data and then > simply > > use Euclidean distance with something like k-means. > > > > Can you say something about how many variables and how many categories > the > > variables have? > > > > > > On Mon, May 6, 2013 at 9:49 AM, Florents Tselai > > <[email protected]>wrote: > > > > > Hello, > > > > > > Are there any suggestions on what mahout algorithms (from mahout) to > use > > > for clustering categorical data? > > > > > >
