On Tue, Nov 6, 2012 at 6:39 AM, Arni Sumarlidason < [email protected]> wrote:
> Dan, > > Thank you for your time, patience, and detailed response. > > Another question; about the results I’m receiving, I don’t understand them > :( > > I’ve run this command: ./mahout cvb -i > /user/root/sparse-vectors-cvb/matrix -o text_lda_sr -k 100 -x 1 -dict > text_vec/dictionary.file-0 -dt text_cvb_document_sr -mt text_states_sr > Followed by: ./mahout vectordump -i /user/root/text_cvb_document_sr -d > text_vec/dictionary.file-0 -dt sequencefile -o lda-cvb-topics.txt > > I get a text file with term frequencies, but I get one line per document I > originally created vectors from, not the 100 topics? I’m I doing something > wrong? > ./mahout vectordump wants to take in vector files: you can give it the text inputs you started with (text_cvb_document_sr, in your case), and you'll just see the "bag-of-words" representation of your input docs. If you give it one of the "model" files (in text_lda_sr), then you'll get the term distributions for the topics. > > Thank you for your help, > > > From: DAN HELM [mailto:[email protected]] > Sent: Sunday, November 04, 2012 6:43 PM > To: Arni Sumarlidason > Cc: [email protected] > Subject: Re: Mahout: CVB: Error > > Arni, > > I had not formally contributed that code but it was posted before via > email. > > Here is an initial approach developed where rowid will output one "part" > file for each input "part" file processed: > > > http://mail-archives.apache.org/mod_mbox/mahout-user/201208.mbox/%3CCAOeeJfiuPMv=vs8rm4co0mjr-bewgecayv3mhbb+yebq_o9...@mail.gmail.com%3E > < > https://console.mxlogic.com/redir/?zC763hOYqejhOrv7cIcECzAQsCM0oCnSdyszfQXlJIj_w0eaRg5li5g_5t9RrCnrFYsjKyDtXBi5g_1X8A920o_M8tAS5xIOspEY2Rm1aJRh45BNDM58_riMppl2UOcw1bucweaRd78S04y2xfy8DOUZAqdTVeZXTLuZXCXCQdxbPOvDGwEPYp2Bos3jqbzbbNJ5BZeUVdYsedFFCMnWhEw6Z9RrCAq818czahEw6ENH4TfM-u0USyrjdIIczxNEVvsdUdrU4MIZnAaF > > > > And this code will enable one to spit the data up more via an optional "m" > parameter that enables one to specify how many vectors (max) to write to a > part file: > > http://permalink.gmane.org/gmane.comp.apache.mahout.devel/21821< > https://console.mxlogic.com/redir/?hP3z1EVud79EVdLzCm6kjhOqejo0drUGWva1nyaKM_w0e4ltx_bHtGSS9_OKAWJPV3PfDUCyYCyrLOtXTLuZXTdTdEr2nDA_fl1hDUO5aMU6CQn6mnzqbbWtNOrUUsrjjdwLQzh0dWjGTd8Qg2gp6kzh0dhzm9KvxYY1NJcSCrpop73zhO-Urn1Le > > > > These were just some quickly developed utilities written some months ago > when working with CVB. Obviously there are other ways to split the data > up. You could also write software to post-process rowid's Matrix output > file and split it up so more mappers run. > > Lately I have been doing more with the Mahout k-means algorithm since I > wanted to be able to cluster lots of documents in a timely manner. > > As specified in the thread you posted below, the run-time of LDA/CVB is > very susceptible to the size of the dictionary processed. This also > affects mapper heap space requirements where each mapper needs to store > (dictionary size * k * 8 * 2) in memory. We also ran into trouble before > with running out of mapper heap space when "dictionary size" and/or "k" > increased a lot so we had to reconfigure hadoop for more mapper heap space > (changed to 1Gb; no big deal to do). > > So yes depending on how much data you are clustering and dictionary size, > it could take a long time to run. > > Dan > > From: Arni Sumarlidason <[email protected]<mailto: > [email protected]>> > To: DAN HELM <[email protected]<mailto:[email protected]>> > Cc: "[email protected]<mailto:[email protected]>" < > [email protected]<mailto:[email protected]>> > Sent: Sunday, November 4, 2012 5:44 PM > Subject: Re: Mahout: CVB: Error > > Dan, > > Regarding this thread, > http://comments.gmane.org/gmane.comp.apache.mahout.user/13641 > > Did you publish your modification to the rowid function enabling the > splitting of Matrix files? A single pass on my data takes 9 hours. Does > this sound reasonable to you? please advise. > > Best, > > Arni > > On Nov 3, 2012, at 8:38 PM, DAN HELM <[email protected]<mailto: > [email protected]>> wrote: > > > Arni, > > I believe you are running with the wrong input for the cvb command: > ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex ..... > > It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix ..... > > docIndex is a file generated by rowid that provides a mapping between the > original sparse vector keys (in Text format) to the Integer keys assigned > by rowid. > > Dan > > From: Arni Sumarlidason <[email protected]<mailto: > [email protected]>> > To: "[email protected]<mailto:[email protected]>" < > [email protected]<mailto:[email protected]>> > Sent: Saturday, November 3, 2012 6:35 PM > Subject: Mahout: CVB: Error > > Good Evening, Thank you for reading.. I am trying to run CVB on mahout > 0.8... > > I have successfully executed the following steps: > ./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8 > -ow -chunk 8 > Resulting in 20 chunk files. > > ./mahout seq2sparse -i text_seq -o text_vec -wt tf -a > org.apache.lucene.analysis.WhitespaceAnalyzer -ow > Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more. > > ./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb > Resulting in "docIndex" & "matrix" > > Now... When attempting to run the following command, > ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100 > -x 20 -dict text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states > Resulting in an error: No part files found in model path > 'text_states/model-1' > > Can someone please point me in the right direction? > > Best regards, > > Arni > > > > > > -- -jake
