Re: lda + vector dump

Liz Merkhofer Fri, 23 Aug 2013 08:38:43 -0700

Hi Charly,

I've been playing around with cvb, too. I have a few thoughts on b,
vectordump:

What are you doing vectordump on? If you're doing it on your cvb output,
you're getting something like a dictionary per topic, with
<input-word-key>:<probability-it's-in-this-cluster>. If you're doing it on
cvb-topics output, for each document, you're getting the likelihood that it
belongs to each of your topics.

I wonder if your problem is that you read the same book I did, "Hadoop
MapReduce Cookbook," that advised to use vectordump with the dictionary
flag as your dictionary from s2s. Don't do that - that translates your
document or topic keys as if they were your vocab keys, and it's just
completely nonsensical.

Best,
Liz Merkhofer

On Fri, Aug 23, 2013 at 11:18 AM, Charly Lizarralde <
[email protected]> wrote:

> Hi everyone, I am experimenting with cvb algorithm and I have a few
> questions....
>
> a) Is there any updated documentation? I have been collecting info from
> mail lists, blogs, etc. I have been writing a small beginers tutorial, if
> you like I'll send it.
>
> b) Should I remove "stop-words" before building the feature vectors ? I am
> having some trouble "reading" the results....
>
> c) Vectordump is not sorting well...is this a reported bug ? ( I am
> building mahout from trunk now )
>
> d) Any considerations on performance? It took 10 hours on a 5 node cluster
> and  I've set 20 iterations on less than 10.000 docs and it took
>
> Thanks!
> Charly
>

Re: lda + vector dump

Reply via email to