Re: Number of Reducers in PFP Growth is always 1 !!!

2012-08-30 Thread Sean Owen
Block size and input size should not matter for the Reducer. You do have to explicitly say the number of workers. It defaults to 1. You do set it with just these methods. Make sure you are setting on the right object and before you run. Look for other things that may be overriding it. I don't

Re: java.lang.NoClassDefFoundError: org/apache/commons/cli2/Option

2012-08-26 Thread Sean Owen
The JAR you ship to Hadoop needs to have all the required class files including third-party dependencies. Right now you're just sending it Mahout classes. Use the .job file that is built by the Maven targets. mvn package should make them. That has all the dependencies packaged up. On Sun, Aug 26,

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

2012-08-26 Thread Sean Owen
It's the same idea, but yes you'd have to re-implement it for Hadoop. Randomly select a subset of users. Identify a small number of most-preferred items for that user -- perhaps the video(s) watched most often. Hold these data points out as a test set. Run your process on all the rest. Make

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

2012-08-26 Thread Sean Owen
Most watched by that particular user. The issue is that the recommender is trying to answer, of all items the user has not interacted with, which is the user most likely to interact with? So the 'right answers' to the quiz it gets ought to be answers to this question. That is why the test data

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

2012-08-26 Thread Sean Owen
interesting. This would seem to work well with our Boolean dataset. We will give this a try. Thanks again for the help. -Jonathan On Sun, Aug 26, 2012 at 3:55 PM, Sean Owen sro...@gmail.com wrote: Most watched by that particular user. The issue is that the recommender is trying to answer

Re: Question regarding correlation value produced by UncenteredCosineSimilarity

2012-08-22 Thread Sean Owen
a mean of zero by nature. Thanks for your time on this question and all of your efforts on Mahout -- it's a great project. best, Francis On Wed, Aug 22, 2012 at 5:11 PM, Sean Owen sro...@gmail.com wrote: The similarity is only defined over the dimensions where both series have a value, yes

Re: Deploying Mahout

2012-08-17 Thread Sean Owen
MapReduce programs are never installed directly on a Hadoop cluster. Hadoop deploys the program JAR to workers as needed. This is not specific to Mahout. Mahout compiles against 0.20.205 and so needs to be used with 0.20.205. It will work with 1.0.3 as far as I know, with a recompile, as they are

Re: How good recommendations and precision works

2012-08-09 Thread Sean Owen
Hi Ziad, I did answer your last question on this list -- don't see this one previously though. The relevant items are chosen as those whose pref value exceed some given threshold. The default threshold is the mean of all 100 pref values plus one standard deviation. Assuming the prefs are about

Re: How good recommendations and precision works

2012-08-09 Thread Sean Owen
The relevant items, the top 16, are a set. You find how many of the recommendations fall in that set. For precision, ordering does not matter. You are right that the metric kind of falls apart for users with very few data points. You want to use precision at a small number, and perhaps ignore the

Re: How good recommendations and precision works

2012-08-09 Thread Sean Owen
Yes, or else those items would not be eligible for recommendation. And it would be like giving students the answers to a test before the test. On Thu, Aug 9, 2012 at 5:41 PM, ziad kamel ziad.kame...@gmail.com wrote: A related question please. Do Mahout remove the 16% good items before

Re: How good recommendations and precision works

2012-08-09 Thread Sean Owen
are the recommended approaches to evaluate the results ? I assume IR approach is one of them. Highly appreciating your help Sean . On Thu, Aug 9, 2012 at 11:45 AM, Sean Owen sro...@gmail.com wrote: Yes, or else those items would not be eligible for recommendation. And it would be like giving

Re: How good recommendations and precision works

2012-08-09 Thread Sean Owen
in a classifier ? Does that mean a recommender becomes a classifier at this case ? On Thu, Aug 9, 2012 at 12:18 PM, Sean Owen sro...@gmail.com wrote: Yes, this is a definite weakness of the precision test as applied to recommenders. It is somewhat flawed; it is easy to apply and has some use

Re: how to deal with mutiple preference values for same (user, item)-pair

2012-08-07 Thread Sean Owen
It depends on what the values really mean. If they are something like ratings, using the most recent version makes most sense. (This is what the implementations do now.) If they are some kind of sampled reading it might make sense to take an average. If the input is based on observed activity, it

Re: question about distributed recommendations

2012-08-04 Thread Sean Owen
Yes, or anywhere else you want to publish static results to, if you don't want to expose HDFS. HDFS isn't good at small random reads, so it would be a question of bulk-loading shards of results. The MapReduce workers are not relevant to serving. They would have produced the results, offline, at

Re: MIA graphs

2012-08-03 Thread Sean Owen
(You can ask in the book forum if it is specific to the book rather than the project. Maybe I can follow up with you directly off list.) Which graph are you referring to? I made them in PowerPoint if I recall correctly, nothing too exotic. On Thu, Aug 2, 2012 at 8:52 PM, Matt Mitchell

Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.

2012-08-03 Thread Sean Owen
This sounds a lot like a bug that was fixed by a patch some time ago. Grant I think it was something I had wanted you to double-check, not sure if you had a look. But I think it was fixed if it's the same issue. On Thu, Aug 2, 2012 at 8:44 AM, Abramov Pavel p.abra...@rambler-co.ruwrote: Thanks

Re: duplicate preferences

2012-08-03 Thread Sean Owen
It overrides older values. Here it would have no effect.

Re: mahout and hadoop configuration question

2012-08-03 Thread Sean Owen
I don't see an error here...? the warning is an ignorable message from hadoop. On Fri, Aug 3, 2012 at 4:56 PM, Sears Merritt sears.merr...@gmail.comwrote: Hi All, I'm trying to run a kmeans job using mahout 0.8 on my hadoop cluster (Cloudera's 0.20.2-cdh3u3) and am running into an odd

Re: mahout and hadoop configuration question

2012-08-03 Thread Sean Owen
:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) On Aug 3, 2012, at 3:00 PM, Sean Owen sro...@gmail.com wrote: I don't see an error here...? the warning is an ignorable message from hadoop. On Fri, Aug 3, 2012 at 4:56 PM, Sears Merritt sears.merr...@gmail.com wrote

Re: question about distributed recommendations

2012-08-03 Thread Sean Owen
Is it reasonable to use 1.5GB of heap for recs? sure -- assuming you can allow the JVM to use, say, 2GB or more of heap total. There are more choices in Mahout for non-distributed recs. The primary distributed version is an item-similarity-based approach but you can choose from several similarity

Re: question about distributed recommendations

2012-08-03 Thread Sean Owen
Good good question. One straightforward way to approach things is to compute all recommendations offline, in batch, and publish them to some location, and then simply read them as needed. Yes your front-end would need to access HDFS if the data were on HDFS. The downside is that you can't update

Re: Problem running grouplens example

2012-08-03 Thread Sean Owen
to validate POM for project org.apache.mahout:mahout-integration at /home/sweiss/mahout-distribution-0.7/integration/pom.xml ... -- Steve On Mon, Jul 30, 2012 at 5:17 PM, Sean Owen sro...@gmail.com wrote: Hmm. what happens if you add this to dependencies in integration? dependency

Re: Question about recommender database drivers

2012-08-02 Thread Sean Owen
The backing store doesn't matter much, in the sense that using it for real-time computation needs it to all end up in memory anyway. It can live wherever you want before that, like Solr. It's not going to be feasible to run anything in real-time off Solr or any other store. Yes the trick is to use

Re: Clustering or Classification?

2012-08-01 Thread Sean Owen
Classifiers are supervised learning algorithms, so you need to provide a bunch of examples of positive and negative classes. In your example, it would be fine to label a bunch of articles as about Apple or not, then use feature vectors derived from TF-IDF as input, with these labels, to train a

Re: Clustering or Classification?

2012-08-01 Thread Sean Owen
for the clarification. So you are saying that Mahout is not suitable in this case or did you say clustering is not the right way to go and If its worth it, I should go for classification? Secondly are you the same Sean Owen who wrote Mahout in Action? :)

Re: MongoDBDataModel doesn't work?

2012-08-01 Thread Sean Owen
If the data is 'really' there in the DataModel you seem to have ruled out all the differences. ;) I imagine there is something slightly amiss. Can you step through with a debugger to see what the UserSimilarity calculates? look what data it gets and see if it makes sense. If it seems to,

Re: Unable to find KMeans Cluster class

2012-08-01 Thread Sean Owen
That may be a typo in the book. I don't know if it was non-abstract in the past. But try against version 0.5 to be sure. I don't know what the replacement code is if so but someone else here likely does. On Wed, Aug 1, 2012 at 9:20 PM, Abhinav M Kulkarni abhinavkulka...@gmail.com wrote: Hi,

Re: UUID based user IDs

2012-08-01 Thread Sean Owen
Yep, just hash to a long, from UUID or String or whatever. The occasional collision does not cause a real problem. If you mix the tastes of two users or items once in a billion times, the overall results will hardly be different. You have to maintain the reverse mapping of course. Look at the

Re: UUID based user IDs

2012-08-01 Thread Sean Owen
No, but I'd recommend XORing the top 64 bits with the bottom 64 bits, something simple like that. On Wed, Aug 1, 2012 at 9:40 PM, Matt Mitchell goodie...@gmail.com wrote: Thanks Sean! That all makes sense. Would you mind recommended a hashing function for this? Is there something in Mahout I

Re: 回复:mahout lib : permissions

2012-07-31 Thread Sean Owen
Mahout currently works against 0.20.205. I do not know if it still works with 0.20.2. You should not have to downgrade. The very first thing is to use Cygwin, not the Windows shell. On Tue, Jul 31, 2012 at 9:23 AM, Julian Ortega julian.ort...@fredhopper.com wrote: Not exactly, you will still

Re: non-text NB classifiers?

2012-07-31 Thread Sean Owen
I don't know this code too much, but, there is simply a step in front I believe that vectorizes text with TF-IDF. The result are simple vectors. You could just inject your vectors (i.e. real-value attributes) at that stage and skip the TF-IDF. It may need a little hacking. On Tue, Jul 31, 2012 at

Re: Extracting data from websites

2012-07-30 Thread Sean Owen
Extract as in web crawl? No it's nothing to do with that. Extract as in entity extraction? I don't think there are relevant implementations here either, though that begins to border on machine learning. This is more about clustering and classification of documents than anything else. On Mon, Jul

Re: Problem running grouplens example

2012-07-30 Thread Sean Owen
Hmm. what happens if you add this to dependencies in integration? dependency groupId${project.groupId}/groupId artifactIdmahout-examples/artifactId /dependency On Mon, Jul 30, 2012 at 9:59 PM, Stephen Weiss swe...@stylesight.com wrote: Hi, I am just getting started with

Re: performance study

2012-07-27 Thread Sean Owen
Are you basically asking how much faster a parallel algorithm is than non-parallel? If you're measuring wall-clock time, the answer depends on how many workers/threads you use to parallelize. The point is the time generally goes down as more workers are added, so there's not one answer. If

Re: mongoDB and mahout

2012-07-26 Thread Sean Owen
If you are doing something using Hadoop -- then the question is really, can you use MongoDB as a data source for Hadoop? I'm sure someone has made an InputFormat for it, yes. Mahout itself doesn't connect to MongoDB, it uses Hadoop, which may get data from many sources. If you're not using

Re: Mahout Performance Issues with Item Based Recommender

2012-07-25 Thread Sean Owen
Look at SamplingCandidateItemsStrategy and its arguments. These are the knobs you can turn to reduce the amount of data considered. You might start with something low like 10 for each of the first 3 args. You can set this on an ItemBasedRecommender once configured. On Tue, Jul 24, 2012 at 11:05

Re: recommender for text

2012-07-24 Thread Sean Owen
This sounds more like a clustering problem to me -- find a centroid, find which cluster a new article belongs to. On Tue, Jul 24, 2012 at 11:04 AM, Alexander Aristov alexander.aris...@gmail.com wrote: yes, good point. What I want to reach is to calculate some average of a group of articles

Re: Mahout Performance Issues with Item Based Recommender

2012-07-24 Thread Sean Owen
Unless your data set is tiny (thousands of users / items), you can't really run straight off a database. It is far too data intensive. Real-time always means in memory to me. Look at the ReloadFromJDBCDataModel wrapper, which will cache the DB data in memory. This should be orders of magnitude

Re: Mahout Performance Issues with Item Based Recommender

2012-07-24 Thread Sean Owen
Hmm, that doesn't sound right. This isn't all that big for data. Any chance you've run a profiler to see the hotspot My guess is that you need to set a CandidateItemStrategy to cut down the number of items considered. On Tue, Jul 24, 2012 at 10:36 PM, Jonathan Nassau jonathan.nas...@gmail.com

Re: : Visualize clusters

2012-07-23 Thread Sean Owen
(Assuming that's 'Mahout in Action' but filtered through iPhone auto-correct...) On Mon, Jul 23, 2012 at 7:04 PM, Alexander Aristov alexander.aris...@gmail.com wrote: Read Nagpur in action: ) Alexander 23.07.2012 21:53 пользователь Wei Shung Chung weish...@gmail.com написал: Hi my

Re: no item recommended by using MongoDBDataModel

2012-07-23 Thread Sean Owen
From this, I don't have any good ideas. I think you would need to dig in with a debugger. First, determine whether the DataModel actually has the data. I am guessing it does not.

Re: MySQL JDBC performance optimization

2012-07-19 Thread Sean Owen
Hmm, call refresh() on reloadModel after it's set up? On Thu, Jul 19, 2012 at 11:54 AM, Nick Katsipoulakis popa...@gmail.comwrote: On 07/18/2012 11:56 PM, Sean Owen wrote: Unless your data set is tiny, like 100K records or less, it is not going to be feasible to run recommendations off

Re: MySQL JDBC performance optimization

2012-07-19 Thread Sean Owen
Oh, that means it's still initializing then. It does take a while to read all that info from the DB potentially. On Thu, Jul 19, 2012 at 2:57 PM, Nick Katsipoulakis popa...@gmail.comwrote: On 07/19/2012 02:50 PM, Sean Owen wrote: Hmm, call refresh() on reloadModel after it's set up?

Re: Adding new users at runtime?

2012-07-18 Thread Sean Owen
Sure, override refresh()? Yes, call refresh() to make it run when you want. On Wed, Jul 18, 2012 at 1:31 AM, Matt Mitchell goodie...@gmail.com wrote: Thanks Sean. This makes sense. I'll see how far I can get with the anonymous user. I wonder, is there any way to hook into when the refresh

Re: A question on mahout

2012-07-18 Thread Sean Owen
Without the denominator, the prediction is not a weighted average -- it's some kind of weighted sum. The values will not be in nearly the same range as the input ratings -- might be in the thousands. It's not a prediction anymore. You can rank on it, but it will just favor items that co-occur with

Re: Need help on Mahout Recommendation - using Preferences

2012-07-18 Thread Sean Owen
If you mean, the user says I like Drama and you return to them Dramas, sure you can do that -- it's not a recommender then. It's not personalized. It's very easy, and may be useful. If you mean, can you prioritize Dramas in recommendations, then, as I've said several times: use the Rescorer! It

Re: question about input format for bayes classification

2012-07-18 Thread Sean Owen
Cardinality is the logical size of the vector, its number of dimensions. You can only add vectors with the same cardinality -- it's not defined what the result is to add, say, a 2D and 3D vector. So yes this vector needs to have a cardinality equal to number of features, it seems. On Wed, Jul 18,

Re: Javadoc for PlusAnonymousConcurrentUserDataModel

2012-07-18 Thread Sean Owen
Yes that's right, I'll change the docs. On Wed, Jul 18, 2012 at 4:52 PM, Eyal Allweil eyal_allw...@yahoo.comwrote: Hello everyone, I think there's a mistake in the javadoc for PlusAnonymousConcurrentUserDataModel. Under the code sample for real time recommendation, it says

Re: MySQL JDBC performance optimization

2012-07-18 Thread Sean Owen
: Hi Owen, is it possible to connect mahout with heterogeneous (parallel database)? Is there some connector which could facilitate these? thanks in advance. -Rizki- On Thu, Jul 19, 2012 at 5:56 AM, Sean Owen sro...@gmail.com wrote: Unless your data set is tiny, like 100K records or less

Re: Adding new users at runtime?

2012-07-17 Thread Sean Owen
There's not a very clean answer to this. The original design from way back when was definitely about reloading a fixed model periodically. So that's always an option -- put the users in your database, or update files, or whatever backs the model and they'll turn up at the next reload. The

Re: Need help on Mahout Recommendation - using Preferences

2012-07-16 Thread Sean Owen
On Mon, Jul 16, 2012 at 9:33 AM, Cleophus Pereira cleophus.pere...@mphasis.com wrote: You mentioned to use IDRescorer to get data based on user preferences. But in mahout schema we have just itemid (number) and scores(double). How can we determine purely based on this what is user

Re: error when evaluating recommender w/boolean prefs

2012-07-15 Thread Sean Owen
This sounds like a target leak, like your test data is actually getting copied into the training data. On Sun, Jul 15, 2012 at 1:08 AM, Matt Mitchell goodie...@gmail.com wrote: One strange thing, and I'm going to dig through the MIA book tonight, is that my user based recommendation evaluator

Re: error when evaluating recommender w/boolean prefs

2012-07-15 Thread Sean Owen
this could happen from duplicate user/pref/score values in my data? How does Mahout handle duplicate entries in data, whether in a load-once file or coming from a refresh? On Sun, Jul 15, 2012 at 4:01 AM, Sean Owen sro...@gmail.com wrote: This sounds like a target leak, like your test data is actually

Re: error when evaluating recommender w/boolean prefs

2012-07-14 Thread Sean Owen
absolutely right. Things are working nicely now. - Matt On Sat, Jul 7, 2012 at 3:48 AM, Sean Owen sro...@gmail.com wrote: What it really means is that there is not enough data to make a meaningful test here. On Sat, Jul 7, 2012 at 1:28 AM, Matt Mitchell goodie...@gmail.com wrote: Hi, I have

Re: error when evaluating recommender w/boolean prefs

2012-07-14 Thread Sean Owen
Ah yes I see that now. Try increasing evaluation percentage to 1.0. At the moment you're only using 10% of the data. That's a quick way to make a bigger test! Also, what happens if you set the threshold to 0.5? On Sat, Jul 14, 2012 at 4:56 PM, Matt Mitchell goodie...@gmail.com wrote: Hey Sean,

Re: Need help on Mahout Recommendation - using Preferences

2012-07-13 Thread Sean Owen
SlopeOneRecommender does not use an ItemSimilarity, what are you referring to ? User and item ID must be an integer (long). You use the IDRescorer to do exactly the query-time filtering you describe. The recommender will give you as many recs as you ask for, unless it is not possible to

Re: item-based recommendation with custom similarity

2012-07-13 Thread Sean Owen
This is too much code to ask people to debug in detail, but I get the gist of it. I am guess that this is happening: the 2 War movies were rated 5.0, and were only tagged War. This means that any other movie tagged only War is estimated to be 5.0, given this similarity definition. And then

Re: Need help on Mahout Recommendation - using Preferences

2012-07-13 Thread Sean Owen
I don't understand this -- you make a recommender and then throw it away and make another one. Why do you have two? Giving recommendations based on user preferences is what all algorithms do. You use a Rescorer to filter results at query time, yes, based on anything you like. On Fri, Jul 13,

Re: Re: item-based recommendation with custom similarity

2012-07-13 Thread Sean Owen
Look at doEstimatePreference(). On Fri, Jul 13, 2012 at 5:16 PM, a a uzayiz...@yahoo.com wrote: Sean Thanks for your quick reply. Switching to a Jaccard coefficient based ItemSimilarity already improved things tremendously. You can change the estimation to account for certainty in some way.

Re: SeqDirectory

2012-07-13 Thread Sean Owen
user-unsubcr...@mahout.apache.org If it doesn't work it's a question for Apache, not the project. We don't run this stuff. On Fri, Jul 13, 2012 at 7:57 PM, Lingxiang Cheng lingxiangch...@yahoo.com wrote: I have unsubscribed from Mahout at least 3 times in the past year. Why do I keep getting

Re: set Number of reducers (spectral kmeans)

2012-07-13 Thread Sean Owen
I was going to say set MAHOUT_OPTS... but I just looked at the script and why does is set the number of mappers/reducers to 1 by default? It sort of looks like it intends to override the user's setting. On Fri, Jul 13, 2012 at 11:33 PM, Aniruddha Basak t-aba...@expedia.com wrote: Hi Sean,

Re: Adding users to dataModel which aren't using any items

2012-07-11 Thread Sean Owen
There would not be any point in this. A user or item with no data has no effect and can't get any recommendations under any algorithm. What are you trying to achieve or solve? Sean On Jul 11, 2012 1:10 AM, Jaspreet Singh jaspr...@usc.edu wrote: Hi, Is it possible to add users and items to

Re: Getting Illegal nDCG: NaN when running RecommenderIRStatsEvaluator

2012-07-11 Thread Sean Owen
It means you don't have enough data to run a meaningful test. On Wed, Jul 11, 2012 at 9:54 AM, Mugoma Joseph Okomba mug...@yengas.com wrote: Hello, While running evaluate () on RecommenderIRStatsEvaluator I am getting the error: java.lang.IllegalArgumentException: Illegal nDCG: NaN Could

Re: Adding users to dataModel which aren't using any items

2012-07-11 Thread Sean Owen
I see. If you're not using collaborative filtering then you're not using Recommender / DataModel. So I don't think your solution includes adding these users / items to the model. Yes, you can start by recommending a simple global top-N most popular items, or, do something reasonable based on

Re: mahout on GPU

2012-07-10 Thread Sean Owen
I don't think this result holds in general -- they chose a very CPU intensive problem, without much data movement. This won't work for, say, Mahout jobs. I don't really see the point in porting Hadoop to a GPU. If you're in a GPU you don't need most of what Hadoop does! That is I imagine this is

Re: mahout on GPU

2012-07-09 Thread Sean Owen
(I agree, it's quite a useful approach -- was answering the question about whether there was any such thing in Mahout. This all assumes you can fit the data in memory in the GPU but that is true for moderately large data sets.) On Mon, Jul 9, 2012 at 9:04 AM, Manuel Blechschmidt

Re: Candidate items for different cases

2012-07-09 Thread Sean Owen
You can derive many metrics based on just co-occurrence, if your data is 1 and 0. Pearson, cosine similarity, Tanimoto/Jaccard, Euclidean distance, log-likelihood all just reduce to counting. Why not at least give the choice? You can keep half the diff matrix since it's symmetric of course.

Re: mahout on GPU

2012-07-09 Thread Sean Owen
The factorization is the heavy number crunching. The client of a recommender needs to do very little computation in comparison, like a vector-matrix product. While a GPU might make this happen faster, it's already on the order of microseconds. Compare with the cost of downloading the whole

Re: Re: Approaches for combining multiple types of item data for user-user similarity

2012-07-09 Thread Sean Owen
than add. On Mon, Jul 9, 2012 at 7:55 AM, bangbig lizhongliangg...@163.com wrote: I have thought about this problem before, and I read several posts talking about this. Sean Owen is right that the math doesn't care about what the things are. But in practice I think a better way is that you

Re: mahout on GPU

2012-07-09 Thread Sean Owen
Hadoop and CUDA are quite at odds -- Hadoop is all about splitting up a problem across quite remote machines while CUDA/GPU approaches rely on putting all computation together not only on one machine but within one graphics card. It doesn't make sense to combine them. Either you want to

Re: mahout on GPU

2012-07-08 Thread Sean Owen
More than that, Mahout is mostly Hadoop-based, which is well up the stack from Java. No there is nothing CUDA-related in the project. The closest thing are the pure Java non-Hadoop-based recommender pieces. But it is still far from CUDA. I think CUDA is intriguing since a lot of ML is a bunch of

Re: error when evaluating recommender w/boolean prefs

2012-07-07 Thread Sean Owen
What it really means is that there is not enough data to make a meaningful test here. On Sat, Jul 7, 2012 at 1:28 AM, Matt Mitchell goodie...@gmail.com wrote: Hi, I have a recommender, with a boolean prefs model. I am following the instructions in the MIA book, but only get this exception:

Shortcut to finding the best recs from factored matrices?

2012-07-06 Thread Sean Owen
Here's one I've been puzzling over for a bit. In a factorization based on the SVD or what have you, you reconstruct the approximate original matrix (well, one row) by multiplying the matrices back together and looking for the largest elements. This is essentially multiplying a user feature vector

Measuring the quality of the model

2012-07-06 Thread Sean Owen
(Changed subject from unrelated thread) You measure precision / recall, or the related F1 measure, or normalized discounted cumulative gain, or ROC. They are different, standard metrics that are less complicated than the sound. On Fri, Jul 6, 2012 at 6:13 PM, Razon, Oren oren.ra...@intel.com

Re: must numeric item and user IDs be sequential, for bin/mahout itemsimilarity?

2012-07-06 Thread Sean Owen
I don't recall that it has ever caused a problem, no. The values are just keys in a hashtable, so don't need to be sequential. On Fri, Jul 6, 2012 at 8:26 PM, Dan Brickley dan...@danbri.org wrote: I recall having problems with this before, using the non-Mahout Taste code. I have meaningful

Re: Shortcut to finding the best recs from factored matrices?

2012-07-06 Thread Sean Owen
6, 2012 at 3:18 AM, Jens Grivolla j+...@grivolla.net wrote: Maybe locality-sensitive hashing can help to get candidates before calculating the exact distance? Bye, Jens On 07/06/2012 11:35 AM, Sean Owen wrote: Here's one I've been puzzling over for a bit. In a factorization based

Re: A bunch of SVD questions...

2012-07-06 Thread Sean Owen
That's right, in the formulation you are referring to you are not predicting the original input values, so you can't compare them with RMSE or something. To test precision / recall you hold out some of the top-rated items (these are the relevant results), and see how many come back in the

Re: What is the best factorizer for low-quality LSA?

2012-07-05 Thread Sean Owen
If you want Java, the implementation in Commons Math is just fine. There are others. Limiting the number of features is just a matter of tossing all but the first k rows, or columns. On Thu, Jul 5, 2012 at 9:46 AM, Lance Norskog goks...@gmail.com wrote: What is a good factorizer for doing

Re: A bunch of SVD questions...

2012-07-05 Thread Sean Owen
: Sean Owen [mailto:sro...@gmail.com] Sent: Wednesday, July 04, 2012 18:39 To: user@mahout.apache.org Subject: Re: A bunch of SVD questions... SVD is not the same thing as ALS, though both are factoring matrices. There is not a distributed SVD-based recommender, though there is a distributed SVD

Re: A bunch of SVD questions...

2012-07-05 Thread Sean Owen
Unless you are recommending users to items too, you don't have a cold start problem for items. If you are, you can apply the same technique. Using fold-in, you can create a reasonable user or item vector from the time you have the very first interaction for the user or item, which solves most of

Re: ItemSimilarity algorithm

2012-07-05 Thread Sean Owen
Well, the other metrics are mostly undefined in this case! so yes. On Thu, Jul 5, 2012 at 6:36 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: Thanks for the input Sean, one other question, in the scenario where most of the recommendations are boolean style recommendations (i.e. a csv file

Re: Approaches for combining multiple types of item data for user-user similarity

2012-07-04 Thread Sean Owen
The best default answer is to put them all in one model. The math doesn't care what the things are. Unless you have a strong reason to weight one data set I wouldn't. If you do, then two models is best. It is hard to weight a subset of the data within most similarity functions. I don't think it

Re: Generating similarity file(s) for item recommender?

2012-07-04 Thread Sean Owen
If your input is 10MB then the good news is you are not near the scale where you need Hadoop. A simple non-distributed Mahout recommender works well, and includes the Rescorer capability you need. That's a fine place to start. The book ought to give a pretty good tour of how that works in chapter

Re: recommendations for new users

2012-07-04 Thread Sean Owen
Have a look at the PlusAnonymousUserDataModel, which is a bit of a hack but a decent sort of solution for this case. It lets you temporarily add a user to the system and then everything else works as normal, so you can make recommendations to these new / temp users. There isn't a way to inject

Re: custom file data model?

2012-07-04 Thread Sean Owen
Sure. It will ignore columns beyond the fourth, which is an optional timestamp. If you just want it to read some common input file but ignore the unused columns, that's easy. You can copy and modify FileDataModel to do whatever you like, if you want it to use this data. You'd have to change other

Re: custom file data model?

2012-07-04 Thread Sean Owen
Look at the example DataModels in integration. The pattern is the same: load it all into memory! it's too slow for real-time otherwise. So there is no point in say moving your data from a DB to Dynamo for scalability if you're using non-distributed code. If you're using Hadoop, DataModel is not

Re: Generating similarity file(s) for item recommender?

2012-07-03 Thread Sean Owen
I'm not sure if Mridul's suggestion does what you want. Do you want to recommend items to users? then no, you do not start with item IDs and recommend to them. It sounds like your question is how to compute similarity data. The first answer is that you do not use Hadoop unless you must use

Re: Does mahout classification depends on amount of data in each category?

2012-07-03 Thread Sean Owen
(Please don't ping your questions on the list -- bad form and makes people less likely to answer.) You do not have to have equal numbers of positive/negative examples. I think you need to go back and read up on the basics of how Bayesian classification works before you dig in to Mahout. This is

Re: ItemSimilarity algorithm

2012-07-03 Thread Sean Owen
Item-item similarity is a property of the information you have on two items and just those items. Whether there are just those 2 items over 500K users, or 2M items over 500K users, makes no difference. So no I don't think that this skew implies you should use any particular algorithm, by itself.

Re: Continued : simple OnlineLogisticRegression classication example using mahout

2012-07-02 Thread Sean Owen
No just set the bias term to 1 in all cases. On Mon, Jul 2, 2012 at 10:13 AM, damodar shetyo akshay.she...@gmail.com wrote: Is it required that i set the bias(intercept) equal to one only?Or can i set it to any constant value x? Also How can choose value of bias for different types of data

Re: LSI using Mahout ssvd - folding a new doc into the space

2012-06-29 Thread Sean Owen
Well the inverse of a diagonal matrix like that is just going to be a diagonal matrix holding the reciprocals (1/x) of the values. That much is easy. But you need to invert more than that to fold in. I admit even I don't know the details of the Mahout implementation you're using, but I imagine

Re: LSI using Mahout ssvd - folding a new doc into the space

2012-06-29 Thread Sean Owen
]. Thanks again for the help, Chris [1] https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf On Fri, Jun 29, 2012 at 4:31 PM, Sean Owen sro...@gmail.com wrote: Well the inverse of a diagonal matrix like that is just going to be a diagonal matrix holding

Re: simple OnlineLogisticRegression classication example using mahout

2012-06-28 Thread Sean Owen
Because equals() is implemented. Two Points that are equals() will not have the same hashCode(), which is wrong. It only matters, I suppose, if Point is used in some context where it matters, like a HashMap key. But it is used as a HashMap key here! It happens to succeed because get() is only ever

Re: Pseudodistributed recommender hangs on AWS EMR

2012-06-28 Thread Sean Owen
I don't think this is something to do with Mahout. Looks like an error from EMR. I have not seen anything like this. On Jun 28, 2012 1:40 PM, Oliver B. Fischer mails...@swe-blog.net wrote: Hi, I try to run some test with the pseudodistributed recommender job at AWS using one of the late 0.7

Re: Continued : simple OnlineLogisticRegression classication example using mahout

2012-06-28 Thread Sean Owen
(The third dimension, 1, is the bias / intercept term. You will probably see this in the literature -- go have a look at a basic intro to logistic regression. I found Andrew Ng's videos on Coursera a good intro-level survey of exactly this.) On Thu, Jun 28, 2012 at 3:57 PM, Ted Dunning

Re: Issue with Mahout in Action Links-Simple-Sorted.txt example

2012-06-28 Thread Sean Owen
It would be best to keep discussions about the book itself to the Manning forum. This has been covered several times there, on this list, and in the book. As the error suggests, your input is not in the right format. You need to convert it or change the mapper to read its format. On Thu, Jun 28,

Re: simple OnlineLogisticRegression classication example using mahout

2012-06-27 Thread Sean Owen
Those are both true; they may not be the issue here. The test point definitely belongs in the first of the two groups you created. Why is the result surprising? On Wed, Jun 27, 2012 at 9:15 AM, Lance Norskog goks...@gmail.com wrote: Not enough samples. Machine learning algorithms in general do

Re: Question about Item Based Collaborative Filtering

2012-06-25 Thread Sean Owen
The error doesn't seem to relate to memory anyway: java.lang.IllegalArgumentException: unresolved address On Mon, Jun 25, 2012 at 7:06 AM, Something Something mailinglist...@gmail.com wrote: Please ignore the latest email.  When I increased the memory size to 8g, all steps worked.  Now

Re: Question about Item Based Collaborative Filtering

2012-06-24 Thread Sean Owen
:08 PM, Sean Owen sro...@gmail.com wrote: Using 1 is just fine for the reasons you give. You would be surprised how OK it is to use this even for dislikes. In fact just omit the third field in your CSV. However you need to set the boolean data flag and choose a similarity metric that is defined

Re: Question about Item Based Collaborative Filtering

2012-06-23 Thread Sean Owen
Using 1 is just fine for the reasons you give. You would be surprised how OK it is to use this even for dislikes. In fact just omit the third field in your CSV. However you need to set the boolean data flag and choose a similarity metric that is defined over such data. Pearson / cosine is not for

<    1   2   3   4   5   6   7   8   9   10   >