Thanks! One thing I am not clear is if each customer review which might be
just few bytes need to be in separate files? I am planning to utilize
hadoop so I was thinking of using SequenceFiles to dump all the raw
comments in a sequenceFile but I am not sure if it would mess up any TFDF
or anything like that. Could someone help me clarify?

On Sun, Apr 8, 2012 at 11:00 PM, Sean Owen <[email protected]> wrote:

> I think you would cluster these like any other text document. The
> centroid of each cluster tells you where the cluster is in
> feature-space, but the features are just words. If you find the
> features (words) with largest absolute value, those ought to be the
> words that appear frequently in the cluster and are what they are
> "about".
>
> As to ratings, not sure how you might want to involve them?
>
> On Sun, Apr 8, 2012 at 11:44 PM, Mohit Anchlia <[email protected]>
> wrote:
> > I am new to Mahout and just going through some tutorials. One of the
> > requirements I am working on involves extracting customer reviews from
> > Amazon for a given item and then clustering those into similar topics to
> > see what in general users have been talking about. So for eg: Rating of >
> > 3 could say user experience is good, quality or rating of <=3 could say
> > price, buggy etc.
> >
> > Could anyone suggest what would be the best way to approach this?
>

Reply via email to