You don't need one latent variable per concept. Concepts are not completely disjoint. Mathematically, this equates to the fact that the concepts need not be completely orthogonal. If you allow even just a slight bit of non-orthogonality (say a few degrees from a right angle), you can store literally billions of concepts in a 300 dimensional space.
As a practical example, if you just pick vectors at random in a 300 dimensional space, almost all of them will be within 10 degrees of orthogonal (99.7% of them, in fact). The practical import of this is that you absolutely do not need to have a gazillion topics with LDA. The nomenclature is confusing here so your interpretation is reasonable, but the "topics" in LDA are not concepts in the meaning that you are using. They have much more to do with dimensions or latent variables. On Wed, Jun 12, 2013 at 8:15 PM, Jake Mannix <[email protected]> wrote: > LDA is not going to easily capture 500's of sensible topics from 8000 > documents. It is typically sensitive to topics in the range of a constant > times a logarithmic function of the number of unique terms in the corpus. > If you try it with 500 topics, I will guarantee that you'll find very > weird things like "topics" like ["bob", "dave", "fred", ...], ["blue", > "magenta", "orange", ... ], ["7am", "12:30", "4pm", "midnight", ... ], > ["hi", "hello", "salutations", "greetings", "whattup", ...]. > > But go ahead and try it with 50, 100, 200, 300, 400, 500, topics, and see > what the look like. I doubt you'll have too much use for the topics when > you get up past 200 or so. > > In general, there is a principled way to do this, where you look at the > held-out perplexity as a function of numTopics, and stop when it plateaus. > The "D" in LDA means that this will happen at a much lower range than 500 > or so. > > If you want much more topics, you need a different prior, but that's way > out of scope of this thread if you're trying to do this "out of the box". > > > On Wed, Jun 12, 2013 at 11:02 AM, Ankur Desai -X (ankurdes - SATYAM > COMPUTER SERVICES LIMITED at Cisco) <[email protected]> wrote: > > > Hi Ted, > > > > My assumption is that there are lot of concepts (keywords/tags for the > > document) usually present in a single document and in 8K documents, you > > might find many unique concepts. We have also done some analysis by > > manually going over about 100 documents and have identified more than 50 > > concepts. > > > > Thanks, > > Ankur > > > > -----Original Message----- > > From: Ted Dunning [mailto:[email protected]] > > Sent: Wednesday, June 12, 2013 10:57 AM > > To: [email protected] > > Subject: Re: Mahout CVB parameters > > > > Why does document concept require such a large K? > > > > > > > > > > On Wed, Jun 12, 2013 at 7:08 PM, Ankur Desai -X (ankurdes - SATYAM > > COMPUTER SERVICES LIMITED at Cisco) <[email protected]> wrote: > > > > > Thanks Jake for your response. I am trying to get concepts out of the > > > documents and for this I want the K to be large around 500. I will > > > run CVB based on your suggestions and see what I get. Appreciate your > > > prompt response. > > > > > > -Ankur > > > > > > -----Original Message----- > > > From: Jake Mannix [mailto:[email protected]] > > > Sent: Wednesday, June 12, 2013 9:22 AM > > > To: [email protected] > > > Subject: Re: Mahout CVB parameters > > > > > > What is the number of terms in your dictionary, after tokenization and > > > vectorization? Typically, for english, you'll get reasonable topics > > > with anywhere from 20-200 topics, tending toward the lower end if > > > you've not got very many documents (like in your case) 20 topics will > > > yield very generic things, 100 is pretty nice, a lot of the time, but > > > 200 or more can lead to really niche things (I've found things like > > > getting one topic to be basically all female first names, for example). > > > > > > Maximum # of iterations I'd say that 20-30 tends to always be enough, > > > but while you're running it, it should be spitting out the perplexity > > > as it goes (you can tell it to calculate this every N iterations, and > > > set N to 1 to check after each iteration, while you're trying to see > how > > it goes). > > > When this perplexity plateaus, you're done. But in practice, I've > > > never needed more than 30 iterations (less the larger your corpus is). > > > > > > As for the smoothing parameters, we really should have an > > > implementation of one of the various ways of finding it automatically, > > > but for now, doing a grid search over values in the range of 0.001 to > > > 0.1 while first testing things out tends to be helpful (so try (alpha, > > > beta) = {0.001, 0.01, 0.1} x {0.001, 0.01, 0.1}) > > > > > > Hope that helps. > > > > > > > > > On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM > > > COMPUTER SERVICES LIMITED at Cisco) <[email protected]> wrote: > > > > > > > Hi, > > > > > > > > I am using mahout CVB to generate topics from about 8K documents. I > > > > am struggling to determine what are some of the best parameters > > > > values > > > to use? > > > > Please help, if you know best way to determine the parameter values > > > > like topic and term smoothing, max number of iterations, or total > > > > number of topics to generate. > > > > > > > > Thanks, > > > > Ankur > > > > > > > > > > > > > > > > -- > > > > > > -jake > > > > > > > > > -- > > -jake >
