Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Marco
Hi, I'm new here so forgive my little experience with Mahout. We're trying to use Mahout (on our hadoop cluster) for calculating topics on almost 14000 documents. I've been following this wiki page (http://goo.gl/DcPVjB) but still getting errors. Here's what I'm doing: 1) creating sequence

Modify number of mappers for a mahout process?

2013-07-31 Thread Fuhrmann Alpert, Galit
Hi, It sounds to me like this could be related to one of the Qs I've posted several days ago (is it?): My mahout clustering processes seem to be running very slow (several good hours on just ~1M items), and I'm wondering if there's anything that needs to be changed in setting/configuration.

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Suneel Marthi
RowId job creates a matrix (IntWritable, VectorWritable) and a docIndex (IntWritable, Text). So you should be seeing 2 files generated -  jojoba/matrix/matrix and jojoba/matrix/docIndex. Seems like you have been feeding docIndex as input to cvb which would cause this exception,  its the

RE: mahout kmeans not generating clusteredPoint dir?

2013-07-31 Thread Fuhrmann Alpert, Galit
Thanks for your response. I'm still confused as I'm trying to run this on real data rather than the reuters example: If I run kmeans on my data: mahout kmeans -k 5 -i inputSeq.dat -o outputPath --maxIter 2 --clusters outputSeeds It creates a directory containing clusters-*, including the

How to SSVD output to generate Clusters

2013-07-31 Thread Stuti Awasthi
Hi All, I wanted to group the documents with same context but which belongs to one single domain together. I have tried KMeans and LDA provided in Mahout to perform the clustering but the groups which are generated are not very good. Hence I thought to use LSA to indentify the context related

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Marco
oops! that did the trick. nonetheless i think the fact that you have to do rowid and generate the matrix should be added to the wiki. after waiting for more than an hour i got and error on Writing final document/topic inference from lda/matrix/matrix to jojoba/do-output   the error is :

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Jake Mannix
If you're supplying a dictionary file (as you are), I'd suggest not specifying the -nt 9 option - you're apparently specifying a numTerms less than the actual number of terms in some of your vectors. If you supply the -dict option, it'll infer the number of terms from reading the dictionary,

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Marco
ok. i'll re run it without that nt (which i supposed was NOT optional). meanwhile i've re-run it on a smallare datasets and though it run successfully (and faster!) when i run vectordump i always get Heap space issue even though we've updated MAHOUT_HEAPSIZE to 1m

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Jake Mannix
On Wed, Jul 31, 2013 at 7:44 AM, Marco zentrop...@yahoo.co.uk wrote: ok. i'll re run it without that nt (which i supposed was NOT optional). Well, it's not optional if you don't supply a dictionary (which is optional) - one of the two is necessary, or else the system doesn't know how big to

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Suneel Marthi
@Marco, look at examples/bin/cluster-reuters.sh for reference on how to run cvb (or any other clustering algo in Mahout) and also on how to invoke the vectordump with the option flags. From: Jake Mannix jake.man...@gmail.com To: user@mahout.apache.org

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Marco
running: mahout vectordump -i jojoba/to-output -d jojoba/vectors/dictionary.file-0 -dt sequencefile --vectorSize 10 -sort jojoba/to-output it's mahout 0.7 (we're using cloudera CDH4.2) Da: Jake Mannix jake.man...@gmail.com A: user@mahout.apache.org

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Suneel Marthi
Please work off of Mahout 0.8, there are lot of fixes and improvements that went for CVB0 in this release. Correct me here Jake? From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Wednesday, July 31, 2013 11:01 AM

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Marco
already looked there. no cvb examle or vectordump :( Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Inviato: Mercoledì 31 Luglio 2013 16:55 Oggetto: Re: Latent Dirichlet Allocatio

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Jake Mannix
On Wed, Jul 31, 2013 at 8:01 AM, Marco zentrop...@yahoo.co.uk wrote: running: mahout vectordump -i jojoba/to-output -d jojoba/vectors/dictionary.file-0 -dt sequencefile --vectorSize 10 -sort jojoba/to-output Yeah, that looks right. it's mahout 0.7 (we're using cloudera CDH4.2) Ah,

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Suneel Marthi
CVB was added to cluster_reuters.sh in 0.8, u wouldn't see it in 0.7. Suggest that you work off of 0.8. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Wednesday, July 31,

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Marco
great. at least i know what's wrong :) will check out if cloudera supports mahout 0.8. meanwhile we'll drop LDA and retry our first approach (k-means) thanks everyone! Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Ted Dunning
On Wed, Jul 31, 2013 at 8:33 AM, Marco zentrop...@yahoo.co.uk wrote: will check out if cloudera supports mahout 0.8. Don't worry about Cloudera support. Mahout support is better. :-)

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Sean Owen
FWIW I know Mahout 0.8 works fine with CDH4 (the mr1 version of course) and is what CDH5 will include. Should be no problems there. On Wed, Jul 31, 2013 at 4:33 PM, Marco zentrop...@yahoo.co.uk wrote: great. at least i know what's wrong :) will check out if cloudera supports mahout 0.8.

Re: How to SSVD output to generate Clusters

2013-07-31 Thread Dmitriy Lyubimov
many people also use PCA options workflow with SSVD and then try clusterize the output U*Sigma which is dimensionally reduced representation of original row-wise dataset. To enable PCA and U*Sigma output, use ssvd -pca true -us true -u false -v false -k=... -q=1 ... -q=1 recommended for

Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
A few architectural questions: http://bit.ly/18vbbaT I created a local instance of the LucidWorks Search on my dev machine. I can quite easily save the similarity vectors from the DRMs into docs at special locations and index them with LucidWorks. But to ingest the docs and put them in

Re: Setting up a recommender

2013-07-31 Thread Andrew Psaltis
Assuming I've got this right, does someone want to help with these? Pat -- I would be interested in helping in anyway needed. I believe Ted's tool is a start, but does not handle all the case envisioned in the design doc, although I could be wrong on this. Anyway I'm pretty open to helping

Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
OK, looks like there *is* some magic in the Lucid config. I believe all I need to do is write out the docs using Solr XML defining fields for each similarity type and the doc name. The rest can be done by standard Lucid hand configuration. I believe this will minimally handle #3 below. On

Re: Setting up a recommender

2013-07-31 Thread B Lyon
I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the could-be-recommended things. On Jul 31, 2013 2:31 PM, Andrew Psaltis andrew.psal...@webtrends.com wrote: Assuming I've got this right, does someone want to

Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
OK and yes. The docs will look like: add doc field name='item_id'ipad/field field name='similar_items'iphone/field field name='cross_action_similar_items'iphone nexus/field /doc doc field name='item_id'iphone/field field

Re: Setting up a recommender

2013-07-31 Thread Ted Dunning
On Wed, Jul 31, 2013 at 11:20 AM, Pat Ferrel pat.fer...@gmail.com wrote: A few architectural questions: http://bit.ly/18vbbaT I created a local instance of the LucidWorks Search on my dev machine. I can quite easily save the similarity vectors from the DRMs into docs at special locations and

Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
The input, which we need synthesized is a log file tsv or csv that looks like this: u1 purchaseiphone u1 purchaseipad u2 purchasenexus-tablet u2 purchasegalaxy u3 purchasesurface u4 purchaseiphone u4 purchase

Re: Setting up a recommender

2013-07-31 Thread Ted Dunning
The fields actually point the other direction. They contain items which, if they appear in a history, indicate that the current document is a good recommendation. This reversal of roles is what makes search work. Going the other way works for a single doc, but that only gives a list of id's

Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
I'd vote for csv then. On Jul 31, 2013, at 12:00 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Jul 31, 2013 at 11:20 AM, Pat Ferrel pat.fer...@gmail.com wrote: A few architectural questions: http://bit.ly/18vbbaT I created a local instance of the LucidWorks Search on my dev machine. I

Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
Sorry not sure what you are saying. If the LLR created DRM has a row: Key: 0, Value { 1:1.0,} where 0 - iphone and 1 - ipad then wouldn't the doc look like doc field name='item_id'ipad/field field name='similar_items'iphone/field /doc or rather the csv equivalent? On Jul 31,

Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
oops, mistyped… If the LLR created DRM has a row: Key: 1, Value { 0:1.0,} where 0 - iphone and 1 - ipad then wouldn't the doc look like doc field name='item_id'ipad/field field name='similar_items'iphone/field /doc On Jul 31, 2013, at 12:14 PM, Pat Ferrel pat.fer...@gmail.com

Setting up a recommender

2013-07-31 Thread B Lyon
Hi Ted I can't tell who you're responding to (thinking me as I worded things ambiguously). I was restating my original thoughts on how it was to be set up that you had earlier confirmed (I think) but what i wrote could be read in two ways. I think pat's last post with corrected example jives

Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
So the XML as CSV would be: item_id,similar_items,cross_action_similar_items ipad,iphone,iphone nexus iphone,ipad,ipad galaxy Note: As I mentioned before the order of the items in the field will encode rank of the similarity strength. This is for cases where you want to find similar items to a

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-07-31 Thread Rafal Lukawiecki
Dear Sebastian, It looks like setting --maxPrefsPerUser 1 have resolved the issue in our case—it seems that the most preferences a user had was just about 5000, so I doubled it just-in-case, but when I operationalise this model, I will make sure to calculate the actual max number of

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-07-31 Thread Sebastian Schelter
Ideally, you would file a bug and see whether it still happens with trunk. I think the problems comes from the fact, that we only use a certain number of preferences from the user for the final recommendation phase. Therefore we can hit an item as recommendation whose preference we neglected.

Re: Setting up a recommender

2013-07-31 Thread B Lyon
Slick idea IMO on the ordering in the field. Fyi to answer your question I am new to a lot of these pieces (and without sustained access to nontablet pc next four days) and cannot at the moment be relied on for the demo setup given this apparent pace, but would like to help as possible with

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-07-31 Thread Ted Dunning
Removing previously recommended or items already in the training data or already marked as Don't show should all better be handled in the presentation layer with other business logic. The rationale is that there is no single correct answer for any of these. Recommending razor blades to somebody

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-07-31 Thread Ted Dunning
On Wed, Jul 31, 2013 at 3:20 PM, Sebastian Schelter s...@apache.org wrote: That's true in general, but for usecases such as generating recommendations in batch for personalized newsletters, its a nice to have feature. I also have the impression that most users expect to not see items with

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-07-31 Thread Rafal Lukawiecki
Perhaps wrongly, but RecommenderJob has been a gateway to Mahout for my colleagues and I. It is easy to use, and intuitive. We are currently using it for an early stage of buying gap analysis. The fact that it would not recommend items with an expressed prior preference was key to considering

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-07-31 Thread Ted Dunning
On Wed, Jul 31, 2013 at 4:06 PM, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Many thanks, I'll report the issue, when I figure out where. :) I can help with that! https://issues.apache.org/jira/browse/MAHOUT

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-07-31 Thread Rafal Lukawiecki
Thank you! In general, should I be putting our efforts into using 0.8 or stick with 0.7 for now, re RecommenderJob? On another note, which might be a different thread, but would you have any ready-made accuracy and reliability validation code to suggest when using RecommenderJob, or do I need

Re: Setting up a recommender

2013-07-31 Thread Ted Dunning
Pat, See inline On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel p...@occamsmachete.com wrote: So the XML as CSV would be: item_id,similar_items,cross_action_similar_items ipad,iphone,iphone nexus iphone,ipad,ipad galaxy Right. Doesn't matter what format. Might want quotes around space

Data distribution guidance for recommendation engines

2013-07-31 Thread Chloe Guszo
Hi all, This questions stems from my use of the alternating least squares method in mahout, but errs on the theoretical side. If this is the wrong place for such a question, I apologize up front and would gladly direct my question to a more appropriate forum, as per your suggestions. I have been