LDA is not going to easily capture 500's of sensible topics from 8000 documents. It is typically sensitive to topics in the range of a constant times a logarithmic function of the number of unique terms in the corpus. If you try it with 500 topics, I will guarantee that you'll find very weird things like "topics" like ["bob", "dave", "fred", ...], ["blue", "magenta", "orange", ... ], ["7am", "12:30", "4pm", "midnight", ... ], ["hi", "hello", "salutations", "greetings", "whattup", ...].
But go ahead and try it with 50, 100, 200, 300, 400, 500, topics, and see what the look like. I doubt you'll have too much use for the topics when you get up past 200 or so. In general, there is a principled way to do this, where you look at the held-out perplexity as a function of numTopics, and stop when it plateaus. The "D" in LDA means that this will happen at a much lower range than 500 or so. If you want much more topics, you need a different prior, but that's way out of scope of this thread if you're trying to do this "out of the box". On Wed, Jun 12, 2013 at 11:02 AM, Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco) <[email protected]> wrote: > Hi Ted, > > My assumption is that there are lot of concepts (keywords/tags for the > document) usually present in a single document and in 8K documents, you > might find many unique concepts. We have also done some analysis by > manually going over about 100 documents and have identified more than 50 > concepts. > > Thanks, > Ankur > > -----Original Message----- > From: Ted Dunning [mailto:[email protected]] > Sent: Wednesday, June 12, 2013 10:57 AM > To: [email protected] > Subject: Re: Mahout CVB parameters > > Why does document concept require such a large K? > > > > > On Wed, Jun 12, 2013 at 7:08 PM, Ankur Desai -X (ankurdes - SATYAM > COMPUTER SERVICES LIMITED at Cisco) <[email protected]> wrote: > > > Thanks Jake for your response. I am trying to get concepts out of the > > documents and for this I want the K to be large around 500. I will > > run CVB based on your suggestions and see what I get. Appreciate your > > prompt response. > > > > -Ankur > > > > -----Original Message----- > > From: Jake Mannix [mailto:[email protected]] > > Sent: Wednesday, June 12, 2013 9:22 AM > > To: [email protected] > > Subject: Re: Mahout CVB parameters > > > > What is the number of terms in your dictionary, after tokenization and > > vectorization? Typically, for english, you'll get reasonable topics > > with anywhere from 20-200 topics, tending toward the lower end if > > you've not got very many documents (like in your case) 20 topics will > > yield very generic things, 100 is pretty nice, a lot of the time, but > > 200 or more can lead to really niche things (I've found things like > > getting one topic to be basically all female first names, for example). > > > > Maximum # of iterations I'd say that 20-30 tends to always be enough, > > but while you're running it, it should be spitting out the perplexity > > as it goes (you can tell it to calculate this every N iterations, and > > set N to 1 to check after each iteration, while you're trying to see how > it goes). > > When this perplexity plateaus, you're done. But in practice, I've > > never needed more than 30 iterations (less the larger your corpus is). > > > > As for the smoothing parameters, we really should have an > > implementation of one of the various ways of finding it automatically, > > but for now, doing a grid search over values in the range of 0.001 to > > 0.1 while first testing things out tends to be helpful (so try (alpha, > > beta) = {0.001, 0.01, 0.1} x {0.001, 0.01, 0.1}) > > > > Hope that helps. > > > > > > On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM > > COMPUTER SERVICES LIMITED at Cisco) <[email protected]> wrote: > > > > > Hi, > > > > > > I am using mahout CVB to generate topics from about 8K documents. I > > > am struggling to determine what are some of the best parameters > > > values > > to use? > > > Please help, if you know best way to determine the parameter values > > > like topic and term smoothing, max number of iterations, or total > > > number of topics to generate. > > > > > > Thanks, > > > Ankur > > > > > > > > > > > -- > > > > -jake > > > -- -jake
