Just a hint - if you're using Solr/Lucene then you should also (probably) resign from using field norms (so that each category is equally scored, regardless of the length of its content). You can also add term-boosts to individual terms at query time so that when you have a document that mentions "selling" more frequently you can query for: selling^1.5 payment^0.5, etc.
There's virtually no limits on how to score/boost terms, experiment all you like :) Dawid On Fri, Oct 11, 2013 at 4:42 PM, Jens Bonerz <[email protected]> wrote: > what a nice idea :-) really like that approach > > > 2013/10/11 Ted Dunning <[email protected]> > >> You don't need Mahout for this. >> >> A very easy way to do this is to gather all the words for each category >> into a document. Thus: >> >> CatA:selling buying sales payment >> CatB:gathering collecting >> CatC:information data info >> >> Then put these into a text retrieval engine so that you have one document >> per category. >> >> When you get a new document to categorize, just use the document as a query >> and you will get a list of possible categories back. Make sure you set the >> default query mode to OR for this. >> >> See http://wiki.apache.org/solr/SolrQuerySyntax for more on the syntax. >> >> >> >> On Fri, Oct 11, 2013 at 5:04 AM, Kasi Subrahmanyam >> <[email protected]>wrote: >> >> > Hi, >> > >> > I have a problem that i would like to implement in mahout clustering. >> > >> > I have input text documents with data like below. >> > >> > Document1: This is the first document of selling information. >> > Document2: This is the second document of gathering information. >> > >> > I also have another look up file with data like below >> > selling:CatA >> > gathering:CatB. >> > information:CatC >> > >> > NOw i would like to cluster the documents with output being genrated as >> > Document1:CatA,CatC >> > Document2:CatB,CatC >> > >> > Please let me know how to achieve this. >> > >> > Thanks, >> > Subbu >> > >>
