I've spent a few weeks tuning Mahout to cluster news articles and have had decent results. Decent, but still not perfect. In trying to think of ways to improve my results I had the idea of running Mahout on output from Stanford's Named Entity Recognizer (NER) instead of the articles themselves, and seeing how that compared. Has anyone tried this? Did it generate more cohesive clusters?
- Clustering raw articles vs clustering (Stanford's) NER output David Noel
