another correction: CVB= *collapsed* variational Bayes.
On Fri, Nov 9, 2012 at 1:37 PM, Dmitriy Lyubimov <[email protected]> wrote: > correction, with LSA you probably want to use rows of U or U*sqrt(Sigma) > (ssvd --uHalfSigma option), not U*Sigma. > > > On Fri, Nov 9, 2012 at 1:34 PM, Dmitriy Lyubimov <[email protected]>wrote: > >> No SGD (stochastic gradient descent) and factorization are two different >> things. More strictly, those are two different classes of problems -- >> factorization and regression. SGD is one implementation for regression >> classifcation. Factorization is finding virtual factors in a user/item >> space (ALS-WR is one of the methods to find such factors). >> >> Yes SGD is in the book but not with your example specifically since I >> meant to apply it after you find latent variables (factors, whatever). >> >> You will get more help on ALS-WR method by staying on the list and also >> perhaps create an archive entry for others to follow in a similar >> situation. The idea is that we all learn together and effectively:) (and i >> score more points for support :) >> >> CVB (if i am not totally off) is something called continuous variational >> Bayes implementation of LDA (Latent Dirichlet Allocation) which may help >> you to analyze content of your web pages IF you manage to grab the text off >> of them. in Mahout, it is facilitated by a package here: >> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/clustering/lda/cvb/package-summary.html >> I >> don't know where exactly wiki help on CVB is, but searching mahout archive >> and stack overflow may help. Again, by staing on the list you may get more >> help with that. >> >> LSA (Latent semantic analysis) is another way to analyze the content of >> you web. See a wikipedia article for refresher, but basically it is a run >> of SVD over tf-idf of unigrams, bigrams etc. Mahout has general pipeline to >> prepare that context data with seqdirectory, seq2sparse commands (again, >> you can find details in the book). Then you just run 'mahout ssvd >> <options>' on the output of seq2sparse and use rows of U*Sigma output for >> the topical allocation values. Somebody will probably correct me on this, >> but I think you can use topical allocation values to further build your >> classification with regressions (SGD). >> >> -d >> >> >> On Fri, Nov 9, 2012 at 1:11 PM, qiaoresearcher >> <[email protected]>wrote: >> >>> Hi Dmitriy, >>> >>> Many thanks for your comments and i really appreciate although I think I >>> may not fully understood you. >>> >>> As I understand, SGD mean stochastic gradient descent, is that right? >>> I What I need now is some example code to : read the files, construct the >>> web page set, then form the vectors. Such steps are called 'factorization' >>> in Mahout, right? >>> >>> Do you mean Mahout in Action has examples similar to what I described? >>> what is CVB and LSA, and SSVD (singular value decomposition?) >>> >>> >>> >
