error running lucene.vectors

2013-08-01 Thread Swami Kevala
I'm running the command mahout lucene.vectors (via cygwin) on a Solr (4.4) index (using Mahout 0.8) I'm getting the following error SEVERE: There are too many documents that do not have a term vector for text Exception in thread main java.lang.IllegalStateException: There are too many

Re: Data distribution guidance for recommendation engines

2013-08-01 Thread Sean Owen
On Thu, Aug 1, 2013 at 3:15 AM, Chloe Guszo chloe.gu...@gmail.com wrote: If I split my data into train and test sets, I can show good performance of Good performance according to what metric? it makes a lot of difference whether you are talking about precision/recall or RMSE. the model on the

Question for RecommenderJob

2013-08-01 Thread hahn jiang
Hi all, I have a question when I use RecommenderJob for item-based recommendation. My input data format is userid,itemid,1, so I set booleanData option is true. The length of users is 9,000,000 but the length of item is 200. When I run the RecommenderJob, the result is null. I try many times

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Simon Chan
We are building PredictionIO that helps to handle a number of business logics. Recommending only items that the user has never expressed a preference before is supported. It is a layer on top of Mahout. Hope it is helpful. Simon On Wed, Jul 31, 2013 at 4:57 PM, Ted Dunning ted.dunn...@gmail.com

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Simon, is there any documentation available, or more info on PredictionIO? -- Rafal Lukawiecki Pardon brevity, mobile device. On 1 Aug 2013, at 09:13, Simon Chan simonc...@gmail.com wrote: We are building PredictionIO that helps to handle a number of business logics. Recommending only items

Why is Lanczos deprecated?

2013-08-01 Thread Fernando Fernández
Hi everyone, Sorry if I duplicate the question but I've been looking for an answer and I haven't found an explanation other than it's not being used (together with some other algorithms). If it's been discussed in depth before maybe you can point me to some link with the discussion. I have

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Simon, my apologies for my dumb question. I found the web site for prediction IO—I did not realise it was a separate project, and I was looking for info in the existing Mahout documentation. I will research it now for our use case. -- Rafal Lukawiecki Strategic Consultant and Director Project

RE: How to SSVD output to generate Clusters

2013-08-01 Thread Stuti Awasthi
Thanks Ted, Dmitriy Il check the Spectral Clustering as well PCA option but first with normal approach I want to execute it once. Here is what I am doing with Mahout 0.7: 1. seqdirectory : ~/mahout-distribution-0.7/bin/mahout seqdirectory -i /stuti/SSVD/ClusteringInput -o

Re: How to SSVD output to generate Clusters

2013-08-01 Thread Chirag Lakhani
Maybe someone can clarify this issue but the spectral clustering implementation assumes an affinity graph, am I correct? Are there direct ways of going from a list of feature vectors to an affinity matrix in order to then implement spectral clustering? On Thu, Aug 1, 2013 at 8:49 AM, Stuti

CHEMDNER CFP and training data

2013-08-01 Thread Martin Krallinger
CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name recognition task (see http://www.biocreative.org/tasks/biocreative-iv/chemdner) (1) The CHEMDNER task (part of The BioCreative IV competition) is a community challenge on named entity recognition of chemical compounds. The

k-means issues

2013-08-01 Thread Marco
So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like china    japan    senkaku    dispute or italy   lampedusa   immgration). I want to run k-means clusteriazion on them. Here's what I do (i'm

Re: Question for RecommenderJob

2013-08-01 Thread Sebastian Schelter
Which version of Mahout are you using? Did you check the output, are you sure that no errors occur? Best, Sebastian On 01.08.2013 09:59, hahn jiang wrote: Hi all, I have a question when I use RecommenderJob for item-based recommendation. My input data format is userid,itemid,1, so I set

Re: k-means issues

2013-08-01 Thread Suneel Marthi
Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too :)) You need to specify the clustering option -cl in your kmeans command. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday,

Re: Why is Lanczos deprecated?

2013-08-01 Thread Sebastian Schelter
IIRC the main reasons for deprecating Lanczos was that in contrast to SSVD, it does not use a constant number of MapReduce jobs and that our implementation has the constraint that all the resulting vectors have to fit into the memory of the driver machine. Best, Sebastian On 01.08.2013 12:15,

Re: k-means issues

2013-08-01 Thread Marco
ok i did put -cl and got clusteredPoints, but then I do clusterdump and always get Wrote 0 clusters - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013

Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Matt Molek
One trick to getting more mappers on a job when running from the command line is to pass a '-Dmapred.max.split.size=' argument. The is a size in bytes. So if you have some hypothetical 10MB input set, but you want to force ~100 mappers, use '-Dmapred.max.split.size=100' On Wed, Jul

Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Matt Molek
Oops, I'm sorry. I had one too many zeros there, should be '-Dmapred.max.split.size=10' Just (input size)/(desired number of mappers)

Re: How to SSVD output to generate Clusters

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 5:49 AM, Stuti Awasthi stutiawas...@hcl.com wrote: I think there is a problem because of NamedVector as after some search I get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067 Note also that this bug is fixed in 0.8

Re: How to SSVD output to generate Clusters

2013-08-01 Thread Ted Dunning
The original motivation of spectral clustering talks about graphs. But the idea of clustering the reduced dimension form of a matrix simply depends on the fact[1] that the metric is approximately preserved by the reduced form and is thus applicable to any matrix. [1] Johnson-Lindenstrauss yet

Re: k-means issues

2013-08-01 Thread Suneel Marthi
Could u post the Command line u r using for clusterdump? From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, August 1, 2013 10:29 AM Subject: Re: k-means issues ok i

Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Ryan Josal
Galit, yes this does sound like this is related, and as Matt said, you can test this by setting the max split size on the CLI. I didn't personally find this to be a reliable and efficient method, so I wrote the -m parameter to my job to set it right every time. It seems that this would be

Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Not following so… Here so is what I've done in probably too much detail: 1) ingest raw log files and split them up by action 2) turn these into Mahout preference files using Mahout type IDs, keeping a map of IDs 3) run the Mahout Item-based recommender using LLR for similarity 4) created a

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Hi Sebastian, I've rechecked the results, and, I'm afraid that the issue has not gone away, contrary to my yesterday's enthusiastic response. Using 0.8 I have retested with and without --maxPrefsPerUser 9000 parameter (no user has more than 5000 prefs). I have also supplied the prefs file,

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Sebastian Schelter
Ok, please file a bug report detailing what you've tested and what results you got. Just to clarify, setting maxPrefsPerUser to a high number still does not help? That surprises me. 2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com Hi Sebastian, I've rechecked the results, and, I'm

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Should I have set that parameter to a value much much larger than the maximum number of actually expressed preferences by a user? I'm working on an anonymised data set. If it works as an error test case, I'd be happy to share it for your re-test. I am still hoping it is my error, not Mahout's.

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Sebastian Schelter
Setting it to the maximum number should be enough. Would be great if you can share your dataset and tests. 2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com Should I have set that parameter to a value much much larger than the maximum number of actually expressed preferences by a user?

Re: Setting up a recommender

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote: For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something above is

Re: Why is Lanczos deprecated?

2013-08-01 Thread Jake Mannix
On Thu, Aug 1, 2013 at 7:08 AM, Sebastian Schelter s...@apache.org wrote: IIRC the main reasons for deprecating Lanczos was that in contrast to SSVD, it does not use a constant number of MapReduce jobs and that our implementation has the constraint that all the resulting vectors have to fit

multi-class classification question

2013-08-01 Thread yikes aroni
Say that I am trying to determine which customers buy particular candy bars. So I want to classify training data consisting of candy bar attributes (an N dimensional vector of variables) into customer attributes (an M dimensional vector of customer attributes). Is there a preferred method when N

Re: k-means issues

2013-08-01 Thread Marco
 mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A:

Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Sorry to be dense but I think there is some miscommunication. The most important question is: am I writing the item-item similarity matrix DRM out to Solr, one row = one Solr doc? For the mapreduce Mahout Item-based recommender this is in tmp/similarityMatrix. If not then please stop me. If I'm

Re: k-means issues

2013-08-01 Thread Suneel Marthi
You also need to specify the distance measure '-dm' to clusterdump. This is the Distance Measure that was used for clustering. (Again look at the example in /examples/bin/cluster-reuters.sh - it has all the steps u r trying to accomplish) From: Marco

Re: k-means issues

2013-08-01 Thread Jeff Eastman
The clustering arguments are usually directories, not files. Try: mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints On 8/1/13 2:51 PM, Marco wrote: mahout

Re: k-means issues

2013-08-01 Thread Marco
thanks a lot. will try your suggestions asap. i was sort of following this http://goo.gl/u8VFZN - Messaggio originale - Da: Jeff Eastman j...@windwardsolutions.com A: user@mahout.apache.org Cc: Inviato: Giovedì 1 Agosto 2013 21:02 Oggetto: Re: k-means issues The clustering arguments

Re: k-means issues

2013-08-01 Thread Suneel Marthi
Thanks for pointing that out. I corrected the Wiki page. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, August 1, 2013 3:08 PM Subject: Re: k-means issues thanks a lot. will try your suggestions asap. i

Re: multi-class classification question

2013-08-01 Thread Ted Dunning
I have talked to one user who had ~60,000 classes and they were able to use OLR with success. The way that they did this was to arrange the output classes into a multi-level tree. Then the trained classifiers at each level of the tree. At any level, if there was a dominating result, then only

Re: Setting up a recommender

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel pat.fer...@gmail.com wrote: Sorry to be dense but I think there is some miscommunication. The most important question is: am I writing the item-item similarity matrix DRM out to Solr, one row = one Solr doc? Each row = one *field* in a Solr doc.

Re: Setting up a recommender

2013-08-01 Thread B Lyon
I am wondering about row/column confusion as well - fleshing out the doc/design with more specifics (which Pat is kind of doing, basically) should make things obvious eventually, imo. The way Pat had phrased it got me to wondering what rationale you use to rank the results when you are querying

Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Yes, storing the similar_items in a field, cross_action_similar_items in another field all on the same doc ided by item ID. Agree that there may be other fields. Storing the rows of [B'B] is ok because it's symmetric. However we did talk about the [B'A] case and I thought we agreed to store

Re: Why is Lanczos deprecated?

2013-08-01 Thread Dmitriy Lyubimov
There's a part of Nathan Halko's dissertation referenced on algorithm page running comparison. In particular, he was not able to compute more than 40 eigenvectors with Lanczos on wikipedia dataset. You may refer to that study. On the accuracy part, it was not observed that it was a problem,

Re: Question for RecommenderJob

2013-08-01 Thread hahn jiang
The version of Mahout which I used is 0.7-cdh4.3.1 and I am sure that no errors occur. I check the output but it has null. I think the problem is my data set. Is it too small about my item set that only 200 elements? On Thu, Aug 1, 2013 at 9:57 PM, Sebastian Schelter s...@apache.org wrote:

Re: Why is Lanczos deprecated?

2013-08-01 Thread Sebastian Schelter
I would also be fine with keeping if there is demand. I just proposed to deprecate it and nobody voted against that at that point in time. --sebastian On 02.08.2013 03:12, Dmitriy Lyubimov wrote: There's a part of Nathan Halko's dissertation referenced on algorithm page running comparison.

Re: Question for RecommenderJob

2013-08-01 Thread Sebastian Schelter
The size should not matter, you should get output, what do you exactly mean by it has null? --sebastian On 02.08.2013 03:44, hahn jiang wrote: The version of Mahout which I used is 0.7-cdh4.3.1 and I am sure that no errors occur. I check the output but it has null. I think the problem is my