Hi, I'm new here so forgive my little experience with Mahout.
We're trying to use Mahout (on our hadoop cluster) for calculating topics on
almost 14000 documents.
I've been following this wiki page (http://goo.gl/DcPVjB) but still getting
errors.
Here's what I'm doing:
1) creating sequence
Hi,
It sounds to me like this could be related to one of the Qs I've posted several
days ago (is it?):
My mahout clustering processes seem to be running very slow (several good hours
on just ~1M items), and I'm wondering if there's anything that needs to be
changed in setting/configuration.
RowId job creates a matrix (IntWritable, VectorWritable) and a docIndex
(IntWritable, Text).
So you should be seeing 2 files generated - jojoba/matrix/matrix and
jojoba/matrix/docIndex.
Seems like you have been feeding docIndex as input to cvb which would cause
this exception, its the
Thanks for your response.
I'm still confused as I'm trying to run this on real data rather than the
reuters example:
If I run kmeans on my data: mahout kmeans -k 5 -i inputSeq.dat -o outputPath
--maxIter 2 --clusters outputSeeds
It creates a directory containing clusters-*, including the
Hi All,
I wanted to group the documents with same context but which belongs to one
single domain together. I have tried KMeans and LDA provided in Mahout to
perform the clustering but the groups which are generated are not very good.
Hence I thought to use LSA to indentify the context related
oops! that did the trick.
nonetheless i think the fact that you have to do rowid and generate the
matrix should be added to the wiki.
after waiting for more than an hour i got and error on
Writing final document/topic inference from lda/matrix/matrix to
jojoba/do-output
the error is :
If you're supplying a dictionary file (as you are), I'd suggest not
specifying the -nt 9 option - you're apparently specifying a numTerms
less than the actual number of terms in some of your vectors. If you
supply the -dict option, it'll infer the number of terms from reading the
dictionary,
ok. i'll re run it without that nt (which i supposed was NOT optional).
meanwhile i've re-run it on a smallare datasets and though it run successfully
(and faster!) when i run vectordump i always get Heap space issue even though
we've updated MAHOUT_HEAPSIZE to 1m
On Wed, Jul 31, 2013 at 7:44 AM, Marco zentrop...@yahoo.co.uk wrote:
ok. i'll re run it without that nt (which i supposed was NOT optional).
Well, it's not optional if you don't supply a dictionary (which is
optional) - one of the two is necessary, or else the system doesn't know
how big to
@Marco, look at examples/bin/cluster-reuters.sh for reference on how to run cvb
(or any other clustering algo in Mahout)
and also on how to invoke the vectordump with the option flags.
From: Jake Mannix jake.man...@gmail.com
To: user@mahout.apache.org
running:
mahout vectordump -i jojoba/to-output -d jojoba/vectors/dictionary.file-0 -dt
sequencefile --vectorSize 10 -sort jojoba/to-output
it's mahout 0.7 (we're using cloudera CDH4.2)
Da: Jake Mannix jake.man...@gmail.com
A: user@mahout.apache.org
Please work off of Mahout 0.8, there are lot of fixes and improvements that
went for CVB0 in this release.
Correct me here Jake?
From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org
Sent: Wednesday, July 31, 2013 11:01 AM
already looked there. no cvb examle or vectordump :(
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org user@mahout.apache.org; Marco
zentrop...@yahoo.co.uk
Inviato: Mercoledì 31 Luglio 2013 16:55
Oggetto: Re: Latent Dirichlet Allocatio
On Wed, Jul 31, 2013 at 8:01 AM, Marco zentrop...@yahoo.co.uk wrote:
running:
mahout vectordump -i jojoba/to-output -d jojoba/vectors/dictionary.file-0
-dt sequencefile --vectorSize 10 -sort jojoba/to-output
Yeah, that looks right.
it's mahout 0.7 (we're using cloudera CDH4.2)
Ah,
CVB was added to cluster_reuters.sh in 0.8, u wouldn't see it in 0.7.
Suggest that you work off of 0.8.
From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi
suneel_mar...@yahoo.com
Sent: Wednesday, July 31,
great. at least i know what's wrong :)
will check out if cloudera supports mahout 0.8.
meanwhile we'll drop LDA and retry our first approach (k-means)
thanks everyone!
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org
On Wed, Jul 31, 2013 at 8:33 AM, Marco zentrop...@yahoo.co.uk wrote:
will check out if cloudera supports mahout 0.8.
Don't worry about Cloudera support. Mahout support is better. :-)
FWIW I know Mahout 0.8 works fine with CDH4 (the mr1 version of
course) and is what CDH5 will include. Should be no problems there.
On Wed, Jul 31, 2013 at 4:33 PM, Marco zentrop...@yahoo.co.uk wrote:
great. at least i know what's wrong :)
will check out if cloudera supports mahout 0.8.
many people also use PCA options workflow with SSVD and then try clusterize
the output U*Sigma which is dimensionally reduced representation of
original row-wise dataset. To enable PCA and U*Sigma output, use
ssvd -pca true -us true -u false -v false -k=... -q=1 ...
-q=1 recommended for
A few architectural questions: http://bit.ly/18vbbaT
I created a local instance of the LucidWorks Search on my dev machine. I can
quite easily save the similarity vectors from the DRMs into docs at special
locations and index them with LucidWorks. But to ingest the docs and put them
in
Assuming I've got this right, does someone want to help with these?
Pat -- I would be interested in helping in anyway needed. I believe Ted's
tool is a start, but does not handle all the case envisioned in the design
doc, although I could be wrong on this. Anyway I'm pretty open to helping
OK, looks like there *is* some magic in the Lucid config. I believe all I need
to do is write out the docs using Solr XML defining fields for each similarity
type and the doc name. The rest can be done by standard Lucid hand
configuration. I believe this will minimally handle #3 below.
On
I'm interested in helping as well.
Btw I thought that what was stored in the solr fields were the llr-filtered
items (ids I guess) for the could-be-recommended things.
On Jul 31, 2013 2:31 PM, Andrew Psaltis andrew.psal...@webtrends.com
wrote:
Assuming I've got this right, does someone want to
OK and yes. The docs will look like:
add
doc
field name='item_id'ipad/field
field name='similar_items'iphone/field
field name='cross_action_similar_items'iphone nexus/field
/doc
doc
field name='item_id'iphone/field
field
On Wed, Jul 31, 2013 at 11:20 AM, Pat Ferrel pat.fer...@gmail.com wrote:
A few architectural questions: http://bit.ly/18vbbaT
I created a local instance of the LucidWorks Search on my dev machine. I
can quite easily save the similarity vectors from the DRMs into docs at
special locations and
The input, which we need synthesized is a log file tsv or csv that looks like
this:
u1 purchaseiphone
u1 purchaseipad
u2 purchasenexus-tablet
u2 purchasegalaxy
u3 purchasesurface
u4 purchaseiphone
u4 purchase
The fields actually point the other direction. They contain items which,
if they appear in a history, indicate that the current document is a good
recommendation.
This reversal of roles is what makes search work.
Going the other way works for a single doc, but that only gives a list of
id's
I'd vote for csv then.
On Jul 31, 2013, at 12:00 PM, Ted Dunning ted.dunn...@gmail.com wrote:
On Wed, Jul 31, 2013 at 11:20 AM, Pat Ferrel pat.fer...@gmail.com wrote:
A few architectural questions: http://bit.ly/18vbbaT
I created a local instance of the LucidWorks Search on my dev machine. I
Sorry not sure what you are saying.
If the LLR created DRM has a row:
Key: 0, Value { 1:1.0,}
where 0 - iphone and 1 - ipad then wouldn't the doc look like
doc
field name='item_id'ipad/field
field name='similar_items'iphone/field
/doc
or rather the csv equivalent?
On Jul 31,
oops, mistyped…
If the LLR created DRM has a row:
Key: 1, Value { 0:1.0,}
where 0 - iphone and 1 - ipad then wouldn't the doc look like
doc
field name='item_id'ipad/field
field name='similar_items'iphone/field
/doc
On Jul 31, 2013, at 12:14 PM, Pat Ferrel pat.fer...@gmail.com
Hi Ted
I can't tell who you're responding to (thinking me as I worded things
ambiguously). I was restating my original thoughts on how it was to be set
up that you had earlier confirmed (I think) but what i wrote could be read
in two ways.
I think pat's last post with corrected example jives
So the XML as CSV would be:
item_id,similar_items,cross_action_similar_items
ipad,iphone,iphone nexus
iphone,ipad,ipad galaxy
Note: As I mentioned before the order of the items in the field will encode
rank of the similarity strength. This is for cases where you want to find
similar items to a
Dear Sebastian,
It looks like setting --maxPrefsPerUser 1 have resolved the issue in our
case—it seems that the most preferences a user had was just about 5000, so I
doubled it just-in-case, but when I operationalise this model, I will make sure
to calculate the actual max number of
Ideally, you would file a bug and see whether it still happens with trunk.
I think the problems comes from the fact, that we only use a certain number
of preferences from the user for the final recommendation phase. Therefore
we can hit an item as recommendation whose preference we neglected.
Slick idea IMO on the ordering in the field.
Fyi to answer your question I am new to a lot of these pieces (and
without sustained access to nontablet pc next four days) and cannot at the
moment be relied on for the demo setup given this apparent pace, but would
like to help as possible with
Removing previously recommended or items already in the training data or
already marked as Don't show should all better be handled in the
presentation layer with other business logic.
The rationale is that there is no single correct answer for any of these.
Recommending razor blades to somebody
On Wed, Jul 31, 2013 at 3:20 PM, Sebastian Schelter s...@apache.org wrote:
That's true in general, but for usecases such as generating recommendations
in batch for personalized newsletters, its a nice to have feature.
I also have the impression that most users expect to not see items with
Perhaps wrongly, but RecommenderJob has been a gateway to Mahout for my
colleagues and I. It is easy to use, and intuitive. We are currently using it
for an early stage of buying gap analysis. The fact that it would not recommend
items with an expressed prior preference was key to considering
On Wed, Jul 31, 2013 at 4:06 PM, Rafal Lukawiecki
ra...@projectbotticelli.com wrote:
Many thanks, I'll report the issue, when I figure out where. :)
I can help with that!
https://issues.apache.org/jira/browse/MAHOUT
Thank you!
In general, should I be putting our efforts into using 0.8 or stick with 0.7
for now, re RecommenderJob?
On another note, which might be a different thread, but would you have any
ready-made accuracy and reliability validation code to suggest when using
RecommenderJob, or do I need
Pat,
See inline
On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel p...@occamsmachete.com wrote:
So the XML as CSV would be:
item_id,similar_items,cross_action_similar_items
ipad,iphone,iphone nexus
iphone,ipad,ipad galaxy
Right. Doesn't matter what format. Might want quotes around space
Hi all,
This questions stems from my use of the alternating least squares method in
mahout, but errs on the theoretical side. If this is the wrong place for
such a question, I apologize up front and would gladly direct my question
to a more appropriate forum, as per your suggestions.
I have been
42 matches
Mail list logo