Thanks! One thing I am not clear is if each customer review which might be just few bytes need to be in separate files? I am planning to utilize hadoop so I was thinking of using SequenceFiles to dump all the raw comments in a sequenceFile but I am not sure if it would mess up any TFDF or anything like that. Could someone help me clarify?
On Sun, Apr 8, 2012 at 11:00 PM, Sean Owen <[email protected]> wrote: > I think you would cluster these like any other text document. The > centroid of each cluster tells you where the cluster is in > feature-space, but the features are just words. If you find the > features (words) with largest absolute value, those ought to be the > words that appear frequently in the cluster and are what they are > "about". > > As to ratings, not sure how you might want to involve them? > > On Sun, Apr 8, 2012 at 11:44 PM, Mohit Anchlia <[email protected]> > wrote: > > I am new to Mahout and just going through some tutorials. One of the > > requirements I am working on involves extracting customer reviews from > > Amazon for a given item and then clustering those into similar topics to > > see what in general users have been talking about. So for eg: Rating of > > > 3 could say user experience is good, quality or rating of <=3 could say > > price, buggy etc. > > > > Could anyone suggest what would be the best way to approach this? >
