I'm running the command mahout lucene.vectors (via cygwin) on a Solr (4.4) index
(using Mahout 0.8)
I'm getting the following error
SEVERE: There are too many documents that do not have a term vector for text
Exception in thread main java.lang.IllegalStateException: There are too
many
On Thu, Aug 1, 2013 at 3:15 AM, Chloe Guszo chloe.gu...@gmail.com wrote:
If I split my data into train and test sets, I can show good performance of
Good performance according to what metric? it makes a lot of
difference whether you are talking about precision/recall or RMSE.
the model on the
Hi all,
I have a question when I use RecommenderJob for item-based recommendation.
My input data format is userid,itemid,1, so I set booleanData option is
true.
The length of users is 9,000,000 but the length of item is 200.
When I run the RecommenderJob, the result is null. I try many times
We are building PredictionIO that helps to handle a number of business
logics. Recommending only items that the user has never expressed a
preference before is supported.
It is a layer on top of Mahout. Hope it is helpful.
Simon
On Wed, Jul 31, 2013 at 4:57 PM, Ted Dunning ted.dunn...@gmail.com
Simon, is there any documentation available, or more info on PredictionIO?
--
Rafal Lukawiecki
Pardon brevity, mobile device.
On 1 Aug 2013, at 09:13, Simon Chan simonc...@gmail.com wrote:
We are building PredictionIO that helps to handle a number of business
logics. Recommending only items
Hi everyone,
Sorry if I duplicate the question but I've been looking for an answer and I
haven't found an explanation other than it's not being used (together with
some other algorithms). If it's been discussed in depth before maybe you
can point me to some link with the discussion.
I have
Simon, my apologies for my dumb question. I found the web site for prediction
IO—I did not realise it was a separate project, and I was looking for info in
the existing Mahout documentation. I will research it now for our use case.
--
Rafal Lukawiecki
Strategic Consultant and Director
Project
Thanks Ted, Dmitriy
Il check the Spectral Clustering as well PCA option but first with normal
approach I want to execute it once.
Here is what I am doing with Mahout 0.7:
1. seqdirectory :
~/mahout-distribution-0.7/bin/mahout seqdirectory -i
/stuti/SSVD/ClusteringInput -o
Maybe someone can clarify this issue but the spectral clustering
implementation assumes an affinity graph, am I correct? Are there direct
ways of going from a list of feature vectors to an affinity matrix in order
to then implement spectral clustering?
On Thu, Aug 1, 2013 at 8:49 AM, Stuti
CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name
recognition task (see
http://www.biocreative.org/tasks/biocreative-iv/chemdner)
(1) The CHEMDNER task (part of The BioCreative IV competition) is a
community challenge on named entity recognition of chemical compounds. The
So I've got 13000 text files representing topics in certain newspaper articles.
Each file is just a tab-separated list of topics (so something like china
japan senkaku dispute or italy lampedusa immgration).
I want to run k-means clusteriazion on them.
Here's what I do (i'm
Which version of Mahout are you using? Did you check the output, are you
sure that no errors occur?
Best,
Sebastian
On 01.08.2013 09:59, hahn jiang wrote:
Hi all,
I have a question when I use RecommenderJob for item-based recommendation.
My input data format is userid,itemid,1, so I set
Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too
:))
You need to specify the clustering option -cl in your kmeans command.
From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org
Sent: Thursday,
IIRC the main reasons for deprecating Lanczos was that in contrast to
SSVD, it does not use a constant number of MapReduce jobs and that our
implementation has the constraint that all the resulting vectors have to
fit into the memory of the driver machine.
Best,
Sebastian
On 01.08.2013 12:15,
ok i did put -cl and got clusteredPoints, but then I do clusterdump and always
get Wrote 0 clusters
- Messaggio originale -
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org user@mahout.apache.org; Marco
zentrop...@yahoo.co.uk
Cc:
Inviato: Giovedì 1 Agosto 2013
One trick to getting more mappers on a job when running from the command
line is to pass a '-Dmapred.max.split.size=' argument. The is a
size in bytes. So if you have some hypothetical 10MB input set, but you
want to force ~100 mappers, use '-Dmapred.max.split.size=100'
On Wed, Jul
Oops, I'm sorry. I had one too many zeros there, should be
'-Dmapred.max.split.size=10'
Just (input size)/(desired number of mappers)
On Thu, Aug 1, 2013 at 5:49 AM, Stuti Awasthi stutiawas...@hcl.com wrote:
I think there is a problem because of NamedVector as after some search I
get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067
Note also that this bug is fixed in 0.8
The original motivation of spectral clustering talks about graphs.
But the idea of clustering the reduced dimension form of a matrix simply
depends on the fact[1] that the metric is approximately preserved by the
reduced form and is thus applicable to any matrix.
[1] Johnson-Lindenstrauss yet
Could u post the Command line u r using for clusterdump?
From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi
suneel_mar...@yahoo.com
Sent: Thursday, August 1, 2013 10:29 AM
Subject: Re: k-means issues
ok i
Galit, yes this does sound like this is related, and as Matt said, you can test
this by setting the max split size on the CLI. I didn't personally find this
to be a reliable and efficient method, so I wrote the -m parameter to my job to
set it right every time. It seems that this would be
Not following so…
Here so is what I've done in probably too much detail:
1) ingest raw log files and split them up by action
2) turn these into Mahout preference files using Mahout type IDs, keeping a map
of IDs
3) run the Mahout Item-based recommender using LLR for similarity
4) created a
Hi Sebastian,
I've rechecked the results, and, I'm afraid that the issue has not gone away,
contrary to my yesterday's enthusiastic response. Using 0.8 I have retested
with and without --maxPrefsPerUser 9000 parameter (no user has more than 5000
prefs). I have also supplied the prefs file,
Ok, please file a bug report detailing what you've tested and what results
you got.
Just to clarify, setting maxPrefsPerUser to a high number still does not
help? That surprises me.
2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com
Hi Sebastian,
I've rechecked the results, and, I'm
Should I have set that parameter to a value much much larger than the maximum
number of actually expressed preferences by a user?
I'm working on an anonymised data set. If it works as an error test case, I'd
be happy to share it for your re-test. I am still hoping it is my error, not
Mahout's.
Setting it to the maximum number should be enough. Would be great if you
can share your dataset and tests.
2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com
Should I have set that parameter to a value much much larger than the
maximum number of actually expressed preferences by a user?
On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote:
For item similarities there is no need to do more than fetch one doc that
contains the similarities, right? I've successfully used this method with
the Mahout recommender but please correct me if something above is
On Thu, Aug 1, 2013 at 7:08 AM, Sebastian Schelter s...@apache.org wrote:
IIRC the main reasons for deprecating Lanczos was that in contrast to
SSVD, it does not use a constant number of MapReduce jobs and that our
implementation has the constraint that all the resulting vectors have to
fit
Say that I am trying to determine which customers buy particular candy
bars. So I want to classify training data consisting of candy bar
attributes (an N dimensional vector of variables) into customer attributes
(an M dimensional vector of customer attributes).
Is there a preferred method when N
mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i
mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt
-p mahout/kmeans-clusters/clusteredPoints
- Messaggio originale -
Da: Suneel Marthi suneel_mar...@yahoo.com
A:
Sorry to be dense but I think there is some miscommunication. The most
important question is: am I writing the item-item similarity matrix DRM out to
Solr, one row = one Solr doc? For the mapreduce Mahout Item-based recommender
this is in tmp/similarityMatrix. If not then please stop me. If I'm
You also need to specify the distance measure '-dm' to clusterdump. This is the
Distance Measure that was used for clustering.
(Again look at the example in /examples/bin/cluster-reuters.sh - it has all the
steps u r trying to accomplish)
From: Marco
The clustering arguments are usually directories, not files. Try:
mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i
mahout/kmeans-clusters/clusters-1-final -n 20 -b 100 -o cdump.txt -p
mahout/kmeans-clusters/clusteredPoints
On 8/1/13 2:51 PM, Marco wrote:
mahout
thanks a lot. will try your suggestions asap.
i was sort of following this http://goo.gl/u8VFZN
- Messaggio originale -
Da: Jeff Eastman j...@windwardsolutions.com
A: user@mahout.apache.org
Cc:
Inviato: Giovedì 1 Agosto 2013 21:02
Oggetto: Re: k-means issues
The clustering arguments
Thanks for pointing that out. I corrected the Wiki page.
From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org
Sent: Thursday, August 1, 2013 3:08 PM
Subject: Re: k-means issues
thanks a lot. will try your suggestions asap.
i
I have talked to one user who had ~60,000 classes and they were able to use
OLR with success.
The way that they did this was to arrange the output classes into a
multi-level tree. Then the trained classifiers at each level of the tree.
At any level, if there was a dominating result, then only
On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel pat.fer...@gmail.com wrote:
Sorry to be dense but I think there is some miscommunication. The most
important question is: am I writing the item-item similarity matrix DRM out
to Solr, one row = one Solr doc?
Each row = one *field* in a Solr doc.
I am wondering about row/column confusion as well - fleshing out the
doc/design with more specifics (which Pat is kind of doing, basically)
should make things obvious eventually, imo.
The way Pat had phrased it got me to wondering what rationale you use to
rank the results when you are querying
Yes, storing the similar_items in a field, cross_action_similar_items in
another field all on the same doc ided by item ID. Agree that there may be
other fields.
Storing the rows of [B'B] is ok because it's symmetric. However we did talk
about the [B'A] case and I thought we agreed to store
There's a part of Nathan Halko's dissertation referenced on algorithm page
running comparison. In particular, he was not able to compute more than 40
eigenvectors with Lanczos on wikipedia dataset. You may refer to that
study.
On the accuracy part, it was not observed that it was a problem,
The version of Mahout which I used is 0.7-cdh4.3.1 and I am sure that no
errors occur. I check the output but it has null.
I think the problem is my data set.
Is it too small about my item set that only 200 elements?
On Thu, Aug 1, 2013 at 9:57 PM, Sebastian Schelter s...@apache.org wrote:
I would also be fine with keeping if there is demand. I just proposed to
deprecate it and nobody voted against that at that point in time.
--sebastian
On 02.08.2013 03:12, Dmitriy Lyubimov wrote:
There's a part of Nathan Halko's dissertation referenced on algorithm page
running comparison.
The size should not matter, you should get output, what do you exactly
mean by it has null?
--sebastian
On 02.08.2013 03:44, hahn jiang wrote:
The version of Mahout which I used is 0.7-cdh4.3.1 and I am sure that no
errors occur. I check the output but it has null.
I think the problem is my
43 matches
Mail list logo