So this isn't really categorical data.

But that is good news.

You can still use the binary representation and there is a good possibility
that these data will cluster reasonably, especially with spectral
techniques.

What I would recommend, however, is that cooccurrence analysis might give
you a better view of things.




On Mon, May 6, 2013 at 11:20 AM, Florents Tselai <[email protected]>wrote:

> I'm working on Market Basket Analysis.
> The "small" data sets consists of 40000 transactions (baskets) and 35
> categories.
> While the large data sets is about 30million baskets and 400 categories.
>
>
> On Mon, May 6, 2013 at 9:17 PM, Ted Dunning <[email protected]> wrote:
>
> > It really depends on your data, but anything that works on text has at
> > least a potential for working on categorical data.
> >
> > It is common to use a 1-of-n encoding for categorical data and then
> simply
> > use Euclidean distance with something like k-means.
> >
> > Can you say something about how many variables and how many categories
> the
> > variables have?
> >
> >
> > On Mon, May 6, 2013 at 9:49 AM, Florents Tselai
> > <[email protected]>wrote:
> >
> > > Hello,
> > >
> > > Are there any suggestions on what mahout algorithms (from mahout) to
> use
> > > for clustering categorical data?
> > >
> >
>

Reply via email to