Dan, Thank you for your time, patience, and detailed response.
Another question; about the results I’m receiving, I don’t understand them :( I’ve run this command: ./mahout cvb -i /user/root/sparse-vectors-cvb/matrix -o text_lda_sr -k 100 -x 1 -dict text_vec/dictionary.file-0 -dt text_cvb_document_sr -mt text_states_sr Followed by: ./mahout vectordump -i /user/root/text_cvb_document_sr -d text_vec/dictionary.file-0 -dt sequencefile -o lda-cvb-topics.txt I get a text file with term frequencies, but I get one line per document I originally created vectors from, not the 100 topics? I’m I doing something wrong? Thank you for your help, From: DAN HELM [mailto:[email protected]] Sent: Sunday, November 04, 2012 6:43 PM To: Arni Sumarlidason Cc: [email protected] Subject: Re: Mahout: CVB: Error Arni, I had not formally contributed that code but it was posted before via email. Here is an initial approach developed where rowid will output one "part" file for each input "part" file processed: http://mail-archives.apache.org/mod_mbox/mahout-user/201208.mbox/%3CCAOeeJfiuPMv=vs8rm4co0mjr-bewgecayv3mhbb+yebq_o9...@mail.gmail.com%3E<https://console.mxlogic.com/redir/?zC763hOYqejhOrv7cIcECzAQsCM0oCnSdyszfQXlJIj_w0eaRg5li5g_5t9RrCnrFYsjKyDtXBi5g_1X8A920o_M8tAS5xIOspEY2Rm1aJRh45BNDM58_riMppl2UOcw1bucweaRd78S04y2xfy8DOUZAqdTVeZXTLuZXCXCQdxbPOvDGwEPYp2Bos3jqbzbbNJ5BZeUVdYsedFFCMnWhEw6Z9RrCAq818czahEw6ENH4TfM-u0USyrjdIIczxNEVvsdUdrU4MIZnAaF> And this code will enable one to spit the data up more via an optional "m" parameter that enables one to specify how many vectors (max) to write to a part file: http://permalink.gmane.org/gmane.comp.apache.mahout.devel/21821<https://console.mxlogic.com/redir/?hP3z1EVud79EVdLzCm6kjhOqejo0drUGWva1nyaKM_w0e4ltx_bHtGSS9_OKAWJPV3PfDUCyYCyrLOtXTLuZXTdTdEr2nDA_fl1hDUO5aMU6CQn6mnzqbbWtNOrUUsrjjdwLQzh0dWjGTd8Qg2gp6kzh0dhzm9KvxYY1NJcSCrpop73zhO-Urn1Le> These were just some quickly developed utilities written some months ago when working with CVB. Obviously there are other ways to split the data up. You could also write software to post-process rowid's Matrix output file and split it up so more mappers run. Lately I have been doing more with the Mahout k-means algorithm since I wanted to be able to cluster lots of documents in a timely manner. As specified in the thread you posted below, the run-time of LDA/CVB is very susceptible to the size of the dictionary processed. This also affects mapper heap space requirements where each mapper needs to store (dictionary size * k * 8 * 2) in memory. We also ran into trouble before with running out of mapper heap space when "dictionary size" and/or "k" increased a lot so we had to reconfigure hadoop for more mapper heap space (changed to 1Gb; no big deal to do). So yes depending on how much data you are clustering and dictionary size, it could take a long time to run. Dan From: Arni Sumarlidason <[email protected]<mailto:[email protected]>> To: DAN HELM <[email protected]<mailto:[email protected]>> Cc: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Sent: Sunday, November 4, 2012 5:44 PM Subject: Re: Mahout: CVB: Error Dan, Regarding this thread, http://comments.gmane.org/gmane.comp.apache.mahout.user/13641 Did you publish your modification to the rowid function enabling the splitting of Matrix files? A single pass on my data takes 9 hours. Does this sound reasonable to you? please advise. Best, Arni On Nov 3, 2012, at 8:38 PM, DAN HELM <[email protected]<mailto:[email protected]>> wrote: Arni, I believe you are running with the wrong input for the cvb command: ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex ..... It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix ..... docIndex is a file generated by rowid that provides a mapping between the original sparse vector keys (in Text format) to the Integer keys assigned by rowid. Dan From: Arni Sumarlidason <[email protected]<mailto:[email protected]>> To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Sent: Saturday, November 3, 2012 6:35 PM Subject: Mahout: CVB: Error Good Evening, Thank you for reading.. I am trying to run CVB on mahout 0.8... I have successfully executed the following steps: ./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8 -ow -chunk 8 Resulting in 20 chunk files. ./mahout seq2sparse -i text_seq -o text_vec -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more. ./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb Resulting in "docIndex" & "matrix" Now... When attempting to run the following command, ./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100 -x 20 -dict text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states Resulting in an error: No part files found in model path 'text_states/model-1' Can someone please point me in the right direction? Best regards, Arni
