RE: Mahout: CVB: Error

Arni Sumarlidason Tue, 06 Nov 2012 06:46:52 -0800

Dan,

Thank you for your time, patience, and detailed response.


Another question; about the results I’m receiving, I don’t understand them :(

I’ve run this command: ./mahout cvb -i /user/root/sparse-vectors-cvb/matrix -o 
text_lda_sr -k 100 -x 1 -dict text_vec/dictionary.file-0 -dt 
text_cvb_document_sr -mt text_states_sr
Followed by: ./mahout vectordump -i /user/root/text_cvb_document_sr -d 
text_vec/dictionary.file-0 -dt sequencefile -o lda-cvb-topics.txt

I get a text file with term frequencies, but I get one line per document I 
originally created vectors from, not the 100 topics? I’m I doing something 
wrong?

Thank you for your help,


From: DAN HELM [mailto:[email protected]]
Sent: Sunday, November 04, 2012 6:43 PM
To: Arni Sumarlidason
Cc: [email protected]
Subject: Re: Mahout: CVB: Error

Arni,

I had not formally contributed that code but it was posted before via email.

Here is an initial approach developed where rowid will output one "part" file 
for each input "part" file processed:

http://mail-archives.apache.org/mod_mbox/mahout-user/201208.mbox/%3CCAOeeJfiuPMv=vs8rm4co0mjr-bewgecayv3mhbb+yebq_o9...@mail.gmail.com%3E<https://console.mxlogic.com/redir/?zC763hOYqejhOrv7cIcECzAQsCM0oCnSdyszfQXlJIj_w0eaRg5li5g_5t9RrCnrFYsjKyDtXBi5g_1X8A920o_M8tAS5xIOspEY2Rm1aJRh45BNDM58_riMppl2UOcw1bucweaRd78S04y2xfy8DOUZAqdTVeZXTLuZXCXCQdxbPOvDGwEPYp2Bos3jqbzbbNJ5BZeUVdYsedFFCMnWhEw6Z9RrCAq818czahEw6ENH4TfM-u0USyrjdIIczxNEVvsdUdrU4MIZnAaF>

And this code will enable one to spit the data up more via an optional "m" 
parameter that enables one to specify how many vectors (max) to write to a part 
file:

http://permalink.gmane.org/gmane.comp.apache.mahout.devel/21821<https://console.mxlogic.com/redir/?hP3z1EVud79EVdLzCm6kjhOqejo0drUGWva1nyaKM_w0e4ltx_bHtGSS9_OKAWJPV3PfDUCyYCyrLOtXTLuZXTdTdEr2nDA_fl1hDUO5aMU6CQn6mnzqbbWtNOrUUsrjjdwLQzh0dWjGTd8Qg2gp6kzh0dhzm9KvxYY1NJcSCrpop73zhO-Urn1Le>

These were just some quickly developed utilities written some months ago when 
working with CVB.   Obviously there are other ways to split the data up.  You 
could also write software to post-process rowid's Matrix output file and split 
it up so more mappers run.

Lately I have been doing more with the Mahout k-means algorithm since I wanted 
to be able to cluster lots of documents in a timely manner.

As specified in the thread you posted below, the run-time of LDA/CVB is very 
susceptible to the size of the dictionary processed.  This also affects mapper 
heap space requirements where each mapper needs to store (dictionary size * k  
* 8 * 2) in memory.  We also ran into trouble before with running out of mapper 
heap space when "dictionary size" and/or "k" increased a lot so we had to 
reconfigure hadoop for more mapper heap space (changed to 1Gb; no big deal to 
do).

So yes depending on how much data you are clustering and dictionary size, it 
could take a long time to run.

Dan

From: Arni Sumarlidason 
<[email protected]<mailto:[email protected]>>
To: DAN HELM <[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Sent: Sunday, November 4, 2012 5:44 PM
Subject: Re: Mahout: CVB: Error

Dan,

Regarding this thread,
http://comments.gmane.org/gmane.comp.apache.mahout.user/13641

Did you publish your modification to the rowid function enabling the splitting 
of Matrix files? A single pass on my data takes 9 hours. Does this sound 
reasonable to you? please advise.

Best,

Arni

On Nov 3, 2012, at 8:38 PM, DAN HELM 
<[email protected]<mailto:[email protected]>> wrote:


Arni,

I believe you are running with the wrong input for the cvb command: ./mahout 
cvb -i /user/root/sparse-vectors-cvb/docIndex .....

It should be: ./mahout cvb -i /user/root/sparse-vectors-cvb/Matrix .....

docIndex is a file generated by rowid that provides a mapping between the 
original sparse vector keys (in Text format) to the Integer keys assigned by 
rowid.

Dan

From: Arni Sumarlidason 
<[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Sent: Saturday, November 3, 2012 6:35 PM
Subject: Mahout: CVB: Error

Good Evening, Thank you for reading.. I am trying to run CVB on mahout 0.8...

I have successfully executed the following steps:
./mahout seqdirectory --input /user/root/lda --output text_seq -c UTF-8 -ow 
-chunk 8
Resulting in 20 chunk files.

./mahout seq2sparse -i text_seq -o text_vec -wt tf -a 
org.apache.lucene.analysis.WhitespaceAnalyzer -ow
Resulting in 109MB vector, "part-r-00000", "dictionary.file-0", and more.

./mahout rowid -i text_vec/tf-vectors -o sparse-vectors-cvb
Resulting in "docIndex" & "matrix"

Now... When attempting to run the following command,
./mahout cvb -i /user/root/sparse-vectors-cvb/docIndex -o text_lda -k 100 -x 20 
-dict text_vec/dictionary.file-0 -dt text_cvb_document -mt text_states
Resulting in an error: No part files found in model path 'text_states/model-1'

Can someone please point me in the right direction?

Best regards,

Arni

RE: Mahout: CVB: Error

Reply via email to