Re: How to map UUID to userId in Preference class to use mahout recommender?

2013-04-07 Thread Sean Owen
You can use the low-order bits, or have a look at what the UUID class does to hash itself to 32 bits in hashCode() and emulate that for 64 bits. Collisions in a 64-bit space are very very very rare, enough to not care about here by a wide margin. A collision only means you confuse prefs from two

Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-06 Thread Sean Owen
For example, here's Y: Y = -0.278098 -0.256438 0.127559 -0.045869 -0.769172 -0.255599 0.150450 -0.436548 0.209881 -0.526238 0.613175 -0.600739 -0.291662 -1.142282 0.277204 -0.296846 -0.175122 0.031656 -0.202138 -0.254480 -0.187816 -0.889571 0.052191 -0.304053

Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-05 Thread Sean Owen
(On this aside -- the Commons Math version uses Householder reflections but operates on a transposed representation for just this reason.) On Thu, Apr 4, 2013 at 11:11 PM, Ted Dunning ted.dunn...@gmail.com wrote: But then I started trying to build a HH version using vector ops and realized

Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-05 Thread Sean Owen
OK yes you're on to something here. I should clarify. Koobas you are right that the ALS algorithm itself is fine here as far as my knowledge takes me. The thing it inverts to solve for a row of X is something like (Y' * Cu * Y + lambda * I). No problem there, and indeed I see why the

Detecting rank-deficiency, or worse, via QR decomposition

2013-04-04 Thread Sean Owen
This is more of a linear algebra question, but I thought it worth posing to the group -- As part of a process like ALS, you solve a system like A = X * Y' for X or for Y, given the other two. A is sparse (m x n); X and Y are tall and skinny (m x k, m x n, where k m,n) For example to solve for

Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-04 Thread Sean Owen
I think that's what I'm saying, yes. Small rows X shouldn't become large rows of A -- and similarly small changes in X shouldn't mean large changes in A. Not quite the same thing but both are relevant. I see that this is just the ratio of largest and smallest singular values. Is there established

Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-04 Thread Sean Owen
the condition number but from what I learned this is probably the thing you want to be looking at. Good luck! [1] http://www.math.ufl.edu/~kees/ConditionNumber.pdf [2] http://www.rejonesconsulting.com/CS210_lect07.pdf On Thu, Apr 4, 2013 at 5:26 PM, Sean Owen sro...@gmail.com wrote: I

Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-04-04 Thread Sean Owen
It might make a difference that you're just running 1 iteration. Normally it's run to 'convergence' -- or here let's say, 10+ iterations to be safe. This is the QR factorization of Y' * Y at the finish? This seems like it can't be right... Y has only 5 vectors in 10 dimensions and Y' * Y is

Re: Parallel GenericRecommenderIRStatsEvaluator?

2013-04-01 Thread Sean Owen
No, just was never written I suppose back in the day. The way it is structured now it creates a test split for each user, which is also slow, and may be challenging to memory limitations as that's N data models in memory. You could take a crack at a patch. When I rewrote this aspect in a separate

Re: Reproducibility, and Recommender Algorithms in Mahout

2013-03-30 Thread Sean Owen
You should be able to get reproducible random seed values by calling RandomUtils.useTestSeed() at the very start of your program. But if your goal is to get an unbiased view of the quality of results, you want to run several times and take the average yes. On Sat, Mar 30, 2013 at 3:57 PM,

Re: Setting preferences in GenericDataModel.

2013-03-29 Thread Sean Owen
Yes it's OK. You need to care for thread safety though, which will be hard. The other problem is that changing the underlying data doesn't necessarily invalidate caches above it. You'll have to consider that part as well. I suppose this is part of why it was conceived as a model where the data is

Re: Number of Clustering MR-Jobs

2013-03-28 Thread Sean Owen
This is really a Hadoop-level thing. I am not sure I have ever successfully induced M/R to run multiple mappers on less than one block of data, even with a low max split size. Reducers you can control. On Thu, Mar 28, 2013 at 9:04 AM, Sebastian Briesemeister

Re: sql data model w/where clause

2013-03-25 Thread Sean Owen
Modify the existing code to change the SQL -- it's just a matter of copying a class that only specifies SQL and making new SQL statements. I think there's a version that even reads from a Properties object. On Mon, Mar 25, 2013 at 12:11 AM, Matt Mitchell goodie...@gmail.com wrote: Hi, I have a

Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
Points from across several e-mails -- The initial item-feature matrix can be just random unit vectors too. I have slightly better results with that. You are finding the least-squares solution of A = U M' for U given A and M. Yes you can derive that analytically as the zero of the derivative of

Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
OK, the 'k iterations' happen inline in one job? I thought the Lanczos algorithm found the k eigenvalues/vectors one after the other. Yeah I suppose that doesn't literally mean k map/reduce jobs. Yes the broader idea was whether or not you might get something useful out of ALS earlier. On Mon,

Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
On Mon, Mar 25, 2013 at 11:25 AM, Sebastian Schelter s...@apache.org wrote: Well in LSI it is ok to do that, as a missing entry means that the document contains zero occurrences of a given term which is totally fine. In Collaborative Filtering with explicit feedback, a missing rating is not

Re: postgres recommendation adapter

2013-03-25 Thread Sean Owen
a ClassNotFoundException I'm using version 0.7 of mahout-core and mahout-math, and version 0.5 of mahout-utils. - Matt On Mon, Mar 25, 2013 at 6:21 AM, Sean Owen sro...@gmail.com wrote: I think you'd have to define not working first On Mon, Mar 25, 2013 at 1:32 AM, Matt Mitchell goodie...@gmail.com

Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
(The unobserved entries are still in the loss function, just with low weight. They are also in the system of equations you are solving for.) On Mon, Mar 25, 2013 at 1:38 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Classic als wr is bypassing underlearning problem by cutting out unrated

Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
On Mon, Mar 25, 2013 at 1:41 PM, Koobas koo...@gmail.com wrote: But the assumption works nicely for click-like data. Better still when you can weakly prefer to reconstruct the 0 for missing observations and much more strongly prefer to reconstruct the 1 for observed data. This does seem

Re: Mathematical background of ALS recommenders

2013-03-25 Thread Sean Owen
: On Mon, Mar 25, 2013 at 9:52 AM, Sean Owen sro...@gmail.com wrote: On Mon, Mar 25, 2013 at 1:41 PM, Koobas koo...@gmail.com wrote: But the assumption works nicely for click-like data. Better still when you can weakly prefer to reconstruct the 0 for missing observations and much more

Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Sean Owen
be normalized in a way? Thank you and sorry for the basic questions. Regards, Agata Filiana On 16 March 2013 13:41, Sean Owen sro...@gmail.com wrote: There are many ways to think about combining these two types of data. If you can make some similarity metric based on age, gender

Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Sean Owen
sense? Or am I confusing myself? Agata On 18 March 2013 14:23, Sean Owen sro...@gmail.com wrote: You would have to make up the similarity metric separately since it depends entirely on how you want to define it. The part of the book you are talking about concerns rescoring, which

Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sean Owen
One word of caution, is that there are at least two papers on ALS and they define lambda differently. I think you are talking about Collaborative Filtering for Implicit Feedback Datasets. I've been working with some folks who point out that alpha=40 seems to be too high for most data sets. After

Re: ALS-WR on Million Song dataset

2013-03-18 Thread Sean Owen
://labrosa.ee.columbia.edu/millionsong/tasteprofile On 18.03.2013 17:47, Sean Owen wrote: One word of caution, is that there are at least two papers on ALS and they define lambda differently. I think you are talking about Collaborative Filtering for Implicit Feedback Datasets. I've been working

Re: Boosting User-Based with the user's attributes

2013-03-18 Thread Sean Owen
somehow loop through the item data and the hobby data and then combine the score for a pair of users? I am having trouble in how to combine both similarity into one metric, could you possibly point me out a clue? Thank you On 18 March 2013 14:54, Sean Owen sro...@gmail.com wrote

Re: reproducibility

2013-03-17 Thread Sean Owen
What's your question? ALS has a random starting point which changes the results a bit. Not sure about KNN though. On Sun, Mar 17, 2013 at 3:03 AM, Koobas koo...@gmail.com wrote: Can anybody shed any light on the issue of reproducibility in Mahout, with and without Hadoop, specifically in the

Re: reproducibility

2013-03-17 Thread Sean Owen
of, a big deal. Maybe it's not much of a concern in machine learning. I am just curious. On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen sro...@gmail.com wrote: What's your question? ALS has a random starting point which changes the results a bit. Not sure about KNN though. On Sun, Mar 17

Re: Boosting User-Based with the user's attributes

2013-03-16 Thread Sean Owen
There are many ways to think about combining these two types of data. If you can make some similarity metric based on age, gender and interests, then you can use it as the similarity metric in GenericBooleanPrefUserBasedRecommender. You would be using both data sets in some way. Of course this

Re: QR decomposition in ALS-WR code

2013-03-15 Thread Sean Owen
I think you are referring to the same step? QR decomposition is how you solve for u_i which is what I imagine the same step you have in mind.

Re: Mahout and Hadoop 2

2013-03-13 Thread Sean Owen
I think someone submitted a different build profile that changes the dependencies for you. I believe the issue is using hadoop-common and not hadoop-core as well as changing versions. I think the rest is compile compatible and probably runtime compatible. But I've not tried. On Wed, Mar 13, 2013

Re: Top-N recommendations from SVD

2013-03-06 Thread Sean Owen
it is a likely performance bug. The computation is AB'. Perhaps you refer to rows of B which are the columns of B'. Sent from my sleepy thumbs set to typing on my iPhone. On Mar 6, 2013, at 4:16 AM, Sean Owen sro...@gmail.com wrote: If there are 100 features, it's more like 2.6M * 2.8M * 100

Re: Top-N recommendations from SVD

2013-03-06 Thread Sean Owen
OK and he mentioned that 10 mappers were running, when it ought to be able to use several per machine. The # of mappers is a function of the input size really, so probably needs to turn down the max file split size to induce more mappers? On Wed, Mar 6, 2013 at 11:16 AM, Sebastian Schelter

Re: Top-N recommendations from SVD

2013-03-06 Thread Sean Owen
the allocation down to negligible levels. On Wed, Mar 6, 2013 at 6:11 AM, Sean Owen sro...@gmail.com wrote: OK, that's reasonable on 35 machines. (You can turn up to 70 reducers, probably, as most machines can handle 2 reducers at once). I think the recommendation step loads one whole matrix

Re: Top-N recommendations from SVD

2013-03-05 Thread Sean Owen
Without any tricks, yes you have to do this much work to really know which are the largest values in UM' for every row. There's not an obvious twist that speeds it up. (Do you really want to compute all user recommendations? how many of the 2.6M are likely to be active soon, or, ever?) First,

Re: Top-N recommendations from SVD

2013-03-05 Thread Sean Owen
if this was sane! I'll have a look into this as well if needed. Thanks for the advice! Josh On 5 March 2013 22:23, Sean Owen sro...@gmail.com wrote: Without any tricks, yes you have to do this much work to really know which are the largest values in UM' for every row. There's not an obvious twist

Re: FileDataModel

2013-03-03 Thread Sean Owen
methods throw an UnsupportedOperationException. I read in an old thread that you had updated these methods to work. I'm not sure what I'm missing here. Can you point me in the right direction? On Mar 2, 2013, at 6:42 AM, Sean Owen wrote: Yes to integrate any new data everything must

Re: FileDataModel

2013-03-02 Thread Sean Owen
Yes to integrate any new data everything must be reloaded. On Mar 2, 2013 6:34 AM, Nadia Najjar ned...@gmail.com wrote: I am using a FileDataModel and remove and insert preferences before estimating preferences. Do I need to rebuild the recommender after these methods are called for it to be

Re: Hadoop version compatibility

2013-03-02 Thread Sean Owen
Although I don't know of any specific incompatibility, I would not be surprised. 0.18 is pretty old. As you can see in pom.xml it currently works against the latest stable version, 1.1.1. On Sat, Mar 2, 2013 at 6:16 PM, MARCOS UBIRAJARA marcosubiraj...@ig.com.brwrote: Dear Gentleman, First

Re: How to remove popular items?

2013-02-27 Thread Sean Owen
It's true, although many of the algorithms will by nature not emphasize popular items. There is an old and semi-deprecated class in the project called InverseUserFrequency, which you can use to manually de-emphasize popular items internally. I wouldn't really recommend it. You can always use

Re: Vector distance within a cluster

2013-02-27 Thread Sean Owen
A common measure of cluster coherence is the mean distance or mean squared difference between the members and the cluster centroid. It sounds like this is the kind of thing you're measuring with this all-pairs distances. That could be a measure too; I've usually seen that done by taking the

Re: Cross recommendation

2013-02-24 Thread Sean Owen
I may not be 100% following the thread, but: Similarity metrics won't care whether some items are really actions and some items are items. The math is the same. The problem which you may be alluding to is the one I mentioned earlier -- there is no connection between item and item-action in the

Re: GenericUserBasedRecommender vs GenericItemBasedRecommender

2013-02-21 Thread Sean Owen
It's also valid, yes. The difference is partly due to asymmetry, but also just historical (i.e. no great reason). The item-item system uses a different strategy for picking candidates based on CandidateItemStrategy. On Thu, Feb 21, 2013 at 2:37 PM, Koobas koo...@gmail.com wrote: In the

Re: Precision used by mahout

2013-02-20 Thread Sean Owen
I think all of the code uses double-precision floats. I imagine much of it could work as well with single-precision floats. MapReduce and a GPU are very different things though, and I'm not sure how you would use both together effectively. On Wed, Feb 20, 2013 at 7:10 AM, shruti ranade

Re: Precision used by mahout

2013-02-20 Thread Sean Owen
over this in addition to what Ted Dunning presented the other day on Solr involment in building/loading cooccurrence matrix for Mahout recommendation, it should be a big leap in innovating Mahout recommendation. Am I missing sothing or just dreamig? Regards,,, Y.Mandai 2013/2/20 Sean Owen sro

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-17 Thread Sean Owen
. Although bigger N values overcomes this problem mostly, still it does not seem totally supervised. On Sun, Feb 17, 2013 at 1:49 AM, Sean Owen sro...@gmail.com wrote: The very question at hand is how to label the data as relevant and not relevant results. The question exists because

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
No, this is not a problem. Yes it builds a model for each user, which takes a long time. It's accurate, but time-consuming. It's meant for small data. You could rewrite your own test to hold out data for all test users at once. That's what I did when I rewrote a lot of this just because it was

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
similar to B than C, which is not true. From: Sean Owen sro...@gmail.com To: Mahout User List user@mahout.apache.org; Ahmet Ylmaz ahmetyilmazefe...@yahoo.com Sent: Saturday, February 16, 2013 8:41 PM Subject: Re: Problems with Mahout's

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
of the test user in a random fashion. On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote: Yes. But: the test sample is small. Using 40% of your data to test is probably quite too much. My point is that it may be the least-bad thing to do. What test are you proposing

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
prediction is clearly a supervised ML problem On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen sro...@gmail.com wrote: This is a good answer for evaluation of supervised ML, but, this is unsupervised. Choosing randomly is choosing the 'right answers' randomly, and that's plainly problematic

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
If you're suggesting that you hold out only high-rated items, and then sample them, then that's what is done already in the code, except without the sampling. The sampling doesn't buy anything that I can see. If you're suggesting holding out a random subset and then throwing away the held-out

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
at 10:29 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: I'm suggesting the second one. In that way the test user's ratings in the training set will compose of both low and high rated items, that prevents the problem pointed out by Ahmet. On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen sro

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
The very question at hand is how to label the data as relevant and not relevant results. The question exists because this is not given, which is why I would not call this a supervised problem. That may just be semantics, but the point I wanted to make is that the reasons choosing a random training

Re: Improving quality of item similarities?

2013-02-14 Thread Sean Owen
Yes, I don't know if removing that data would improve results. It might mean you can compute things faster, at little or no observable loss in quality of the results. I'm not sure, but you probably have repeat purchases of the same item, and items of different value. Working in that data may help

Re: Shopping cart

2013-02-14 Thread Sean Owen
This sounds like a job for frequent item set mining, which is kind of a special case of the ideas you've mentioned here. Given N items in a cart, which next item most frequently occurs in a purchased cart? On Thu, Feb 14, 2013 at 6:30 PM, Pat Ferrel pat.fer...@gmail.com wrote: I thought you

Re: Shopping cart

2013-02-14 Thread Sean Owen
harder to implement but we can also test precision on that and compare the two. The recommender method below should be reasonable AFAICT except for the method(s) of retrieving recs, which seem likely to be slow. On Feb 14, 2013, at 9:45 AM, Sean Owen sro...@gmail.com wrote: This sounds like

Re: Shopping cart

2013-02-14 Thread Sean Owen
comparisons--worst case. Each cart is likely to have only a few items in it and I imagine this speeds the similarity calc. I guess I'll try it as described and optimize for speed if the precision is good compared to the apriori algo. On Feb 14, 2013, at 10:57 AM, Sean Owen sro...@gmail.com wrote

Re: Implicit preferences

2013-02-10 Thread Sean Owen
I think you'd have to hack the code to not exclude previously-seen items, or at least, not of the type you wish to consider. Yes you would also have to hack it to add rather than replace existing values. Or for test purposes, just do the adding yourself before inputting the data. My hunch is that

Re: Implicit preferences

2013-02-10 Thread Sean Owen
of the sparsified versions of these and let the search engine handle the weighting of different components at query time. Having these components separated into different fields in the search index seems to help quite a lot, which makes a fair bit of sense. On Sun, Feb 10, 2013 at 9:55 AM, Sean Owen

Re: Rating scale

2013-02-04 Thread Sean Owen
You don't have to fix a scale. But your data needs to be consistent. It wouldn't work to have users rate on a 1-5 scale one day, and 1-100 tomorrow (unless you go back and normalize the old data to 1-100). On Mon, Feb 4, 2013 at 3:56 PM, Zia mel ziad.kame...@gmail.com wrote: Hi , is there a

Re: Failed to execute goal Surefire plugin -- any ideas?

2013-02-04 Thread Sean Owen
You can -DskipTests to skip tests, since that's what it is complaining about. There aren't any current failures in trunk so could be something specific to your setup. Or a flaky test. It may still be something to fix. On Mon, Feb 4, 2013 at 3:37 PM, jellyman colm_r...@hotmail.com wrote: Hi

Re: Threshold-based neighborhood and getReach

2013-02-04 Thread Sean Owen
You are asking for a smaller and smaller neighborhood around a user. At some point the neighborhood includes no users, for some people -- or, the neighborhood includes no new items. Nothing can be recommended, and so recall drops. Precision and recall tend to go in opposite directions for similar

Re: Server sizing Hadoop + Mahout

2013-02-02 Thread Sean Owen
The problem with this POV is that it assumes it's obvious what the right outcome is. With a transaction test or a disk write test or big sort, it's obvious and you can make a benchmark. With ML, it's not even close. For example, I can make you a recommender that is literally as fast as you like

Re: (near) real time recommender/predictor

2013-01-31 Thread Sean Owen
It's a good question. I think you can achieve a partial solution in Mahout. Real-time suggests that you won't be able to make use of Hadoop-based implementations, since they are by nature big batch processes. All of the implementations accept the same input -- user,item,value. That's OK; you can

Re: Using setPreference() to update recommendations in DataModel in Memory

2013-01-30 Thread Sean Owen
:30 PM, Sean Owen sro...@gmail.com wrote: It doesn't really work this way. The model is predicated on loading the data from backing store periodically. In the short term it is read only. This method is misleading in a sense. On Jan 29, 2013 3:31 PM, Henning Kuich hku...@gmail.com wrote: Dear

Re: Using setPreference() to update recommendations in DataModel in Memory

2013-01-29 Thread Sean Owen
It doesn't really work this way. The model is predicated on loading the data from backing store periodically. In the short term it is read only. This method is misleading in a sense. On Jan 29, 2013 3:31 PM, Henning Kuich hku...@gmail.com wrote: Dear All, I would like to be able to update

Re: Question about server/computer architecture...

2013-01-29 Thread Sean Owen
This is quite small and certainly doesn't require Hadoop. That's the good news. Any reasonable server will do well for you. You won't be memory bound. More cores will let you serve more QPS. Your pain points will be elsewhere like tuning for best quality and real time updates. See my separate

Re: QRDecomposition performance

2013-01-28 Thread Sean Owen
Is it worth simply using the Commons Math implementation? On Mon, Jan 28, 2013 at 8:04 AM, Sebastian Schelter s...@apache.org wrote: This is great news and will automatically boost the performance of all our ALS-based recommenders which are all using QRDecomposition internally. On 28.01.2013

Re: MatrixMultiplicationJob runs with 1 mapper only ?

2013-01-28 Thread Sean Owen
. Is it even possible that MatrixMultiplication can run distributedly on multiple mappers as it internally uses CompositeInputFormat . Please Suggest Thanks Stuti -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Wednesday, January 23, 2013 6:42 PM To: Mahout User

Re: Precision question

2013-01-28 Thread Sean Owen
= evaluator.evaluate(recommenderBuilder, null, model, null, 10, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 0.05); Many thanks On Fri, Jan 25, 2013 at 12:26 PM, Sean Owen sro...@gmail.com

Re: Precision question

2013-01-28 Thread Sean Owen
Yes several independent samples of all the data will, together, give you a better estimate of the real metric value than any individual one. On Mon, Jan 28, 2013 at 5:41 PM, Zia mel ziad.kame...@gmail.com wrote: What about running several tests on small data , can't that give an indicator of

Re: Precision question

2013-01-25 Thread Sean Owen
The way I do it is to set x different for each user, to the number of items in the user's test set -- you ask for x recommendations. This makes precision == recall, note. It dodges this problem though. Otherwise, if you fix x, the condition you need is stronger, really: each user needs = x *test

Re: Precision question

2013-01-25 Thread Sean Owen
? mm Something like selecting y set , each set have a min of z user ? On Fri, Jan 25, 2013 at 12:09 PM, Sean Owen sro...@gmail.com wrote: The way I do it is to set x different for each user, to the number of items in the user's test set -- you ask for x recommendations. This makes precision

Re: EMR setup for seq2sparse

2013-01-24 Thread Sean Owen
In my experience, using many small instances hurts since there is more data transferred (less data is local to any given computation) and the instance have lower I/O performance. On the high end, super-big instances become counter-productive because they are not as cheap on the spot market -- and

Re: Boolean preferences and evaluation

2013-01-24 Thread Sean Owen
On Tue, Jan 22, 2013 at 10:42 AM, Sean Owen sro...@gmail.com wrote: Yes any metric that concerns estimated value vs real value can't be used since all values are 1. Yes, when you use the non-boolean version with boolean data you always get 1. When you use the boolean version with boolean

Re: Boolean preferences and evaluation

2013-01-24 Thread Sean Owen
Well, if you are throwing away rating data, you are throwing away rating data. They are no longer 100% different but 100% the same. If that's not a good thing to do, don't do it. It's possible that using ratings gets better precision, and it's possible that it doesn't. It depends on whether the

Re: Boolean preferences and evaluation

2013-01-24 Thread Sean Owen
Yes, but the similarities are no longer weights, because there is nothing to weight. They are used to compute a score directly, which is not a weighted average but a function of the similarities themselves. While it is true that more distant neighbors have less effect in general, when the

Re: MatrixMultiplicationJob runs with 1 mapper only ?

2013-01-23 Thread Sean Owen
not got any success http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc ers-in-DistributedRowMatrix-Jobs-td888980.html Stuti -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Wednesday, January 16, 2013 4:46 PM

Re: Finding best NearestNUserNeighborhood size

2013-01-23 Thread Sean Owen
The stochastic nature of the evaluation means your results will vary randomly from run to run. This looks to my eyeballs like most of the variation you see. You probably want to average over many runs. You will probably find that accuracy peaks around some neighborhood size: adding more useful

Re: Finding best NearestNUserNeighborhood size

2013-01-23 Thread Sean Owen
That is good for making a test repeatable because you are picking the same random sample repeatedly. For evaluation purposes here that's not a good thing and you do want several actually different samples of the result. On Jan 23, 2013 1:19 PM, Stevo Slavić ssla...@gmail.com wrote: When

Re: Boolean preferences and evaluation

2013-01-23 Thread Sean Owen
they were not using a Boolean recommender , something like code 1 maybe? Thanks On Tue, Jan 22, 2013 at 10:42 AM, Sean Owen sro...@gmail.com wrote: Yes any metric that concerns estimated value vs real value can't be used since all values are 1. Yes, when you use the non-boolean version

Re: ItemBased and data size

2013-01-23 Thread Sean Owen
It's hard to make such generalization, but all else equal, I'd expect more data to improve results and decrease error, yes. On Wed, Jan 23, 2013 at 8:02 PM, Zia mel ziad.kame...@gmail.com wrote: Is there a relation between ItemBased and data size? I found when I increase the data size the MAE

Re: Boolean preferences and evaluation

2013-01-22 Thread Sean Owen
GenericUserBasedRecommender(model, neighborhood, similarity); }}; On Tue, Jan 22, 2013 at 1:58 AM, Sean Owen sro...@gmail.com wrote: No it's really #2, since the first still has data that is not true/false. I am not sure what eval you are running, but an RMSE test wouldn't be useful

Re: Boolean preferences and evaluation

2013-01-22 Thread Sean Owen
. Moreover , when I use DataModel model = new FileDataModel(new File(ua.base)); in code 2, the MAE score was higher. When you say RMSE can't be used with boolean data, I assume MAE also can't be used? Thanks ! On Tue, Jan 22, 2013 at 10:08 AM, Sean Owen sro...@gmail.com wrote: RMSE can't

Re: Question - Mahout Taste - User-Based Recommendations...

2013-01-22 Thread Sean Owen
Yes that's right. Look as UserBasedRecommender.mostSimilarUserIDs(), and Recommender.estimatePreference(). These do what you are interested in, and yes they are easy since they are just steps in the recommendation process anyway. On Tue, Jan 22, 2013 at 6:38 PM, Henning Kuich hku...@gmail.com

Re: Question - Mahout Taste - User-Based Recommendations...

2013-01-22 Thread Sean Owen
for the quick reply! HK On Tue, Jan 22, 2013 at 7:40 PM, Sean Owen sro...@gmail.com wrote: Yes that's right. Look as UserBasedRecommender.mostSimilarUserIDs(), and Recommender.estimatePreference(). These do what you are interested in, and yes they are easy since they are just steps

Re: Changing in-memory DataModel to a DB dependent only DataModel after building recommender

2013-01-21 Thread Sean Owen
You would have to write this yourself, yes. If you're not keeping the data in memory, you're not updating the results in real-time. So there's no real need to keep any DataModel around at all. Just pre-compute and store recommendations and update them periodically. Nothing has to be on-line then.

Re: Changing in-memory DataModel to a DB dependent only DataModel after building recommender

2013-01-21 Thread Sean Owen
matrix? So it would make memory usage much worse, even if it is possible. Wouldn't it better to keep the model and compute whenever necessary? Thanks Ceyhun Can Ulker On Mon, Jan 21, 2013 at 9:58 PM, Sean Owen sro...@gmail.com wrote: You would have to write this yourself, yes. If you're

Re: Boolean preferences and evaluation

2013-01-21 Thread Sean Owen
No it's really #2, since the first still has data that is not true/false. I am not sure what eval you are running, but an RMSE test wouldn't be useful in case #2. It would always be 0 since there is only one value in the universe: 1. No value can ever be different from the right value. On Tue,

Re: Any utility to solve the matrix inversion in Map/Reduce Way

2013-01-18 Thread Sean Owen
And, do you really need an inverse, or pseudo-inverse? But, no, there are really no direct utilities for this. But we could probably tell you how to do it efficiently, as long as you don't actually mean a full inverse. On Fri, Jan 18, 2013 at 11:58 AM, Ted Dunning ted.dunn...@gmail.com wrote:

Re: Problem with mahout and AWS

2013-01-18 Thread Sean Owen
You should give more detail about the errors. You are running out of memory on the child workers. This is not surprising since the default memory they allocate is fairly small, and you're running a complete recommender system inside each mapper. It has not much to do with the size of the instane

Re: trying to get grouplens example to run

2013-01-17 Thread Sean Owen
That's the error right there: On Thu, Jan 17, 2013 at 9:57 PM, Kamal Ali k...@grokker.com wrote: Caused by: java.io.IOException: Unexpected input format on line: 1 1 5

RE: MatrixMultiplicationJob runs with 1 mapper only ?

2013-01-16 Thread Sean Owen
. Please Suggest -Original Message- From: Sean Owen [mailto:sro...@gmail.com] Sent: Wednesday, January 16, 2013 1:23 PM To: Mahout User List Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ? It's up to Hadoop in the end. Try calling FileInputFormat.setMaxInputSplitSize

Re: Test multiple similarities using the same data

2013-01-16 Thread Sean Owen
You can try resetting all the random seeds with RandomUtils.useTestSeed() On Jan 16, 2013 4:01 PM, Zia mel ziad.kame...@gmail.com wrote: Hi How to evaluate a recommender using different similarities ? Once we call evaluator.evaluate(recommenderBuilder,..) it will decide the training and test

Re: Recommend to a group of users

2013-01-16 Thread Sean Owen
Not really directly, no. You can make N individual recommendations and combine them, and there are many ways to do that. You can blindly rank them on their absolute scores. You can interleave rankings so each gets every Nth slot in the recommendation. A popular metric is to rank by least-aversion

Re: threshold assignment / selection

2013-01-15 Thread Sean Owen
It's fairly arbitrary. Strong positive ratings are probably more than merely above average, but you could define the threshold higher or lower if you wanted. It's a good default. On Tue, Jan 15, 2013 at 3:58 PM, Zia mel ziad.kame...@gmail.com wrote: Hi Why in recommender the threshold is

Re: Choosing precision

2013-01-15 Thread Sean Owen
Precision is not a great metric for recommenders, but it exists. There is no best value here; I would choose something that mirrors how you will use the results. If you show top 3 recs, use 3. On Tue, Jan 15, 2013 at 4:51 PM, Zia mel ziad.kame...@gmail.com wrote: Hello, If I have users that

Re: Choosing precision

2013-01-15 Thread Sean Owen
The best tests are really from real users. A/B test different recommenders and see which has better performance. That's not quite practical though. The problem is that you don't even know what the best recommendations are. Splitting the data by date is reasonable, but recent items aren't

Re: RMSRecommenderEvaluator RMSE

2013-01-15 Thread Sean Owen
You have the definition there already, what are you asking? On Jan 15, 2013 5:58 PM, Zia mel ziad.kame...@gmail.com wrote: Hi again , When evaluting preferences in recommenders and using RMSRecommenderEvaluator, is it RMSE/RMSD http://en.wikipedia.org/wiki/Root_mean_square_deviation If we

Re: Failed to create /META-INF/license file on Mac system

2013-01-15 Thread Sean Owen
http://stackoverflow.com/questions/10522835/hadoop-java-io-ioexception-mkdirs-failed-to-create-some-path On Tue, Jan 15, 2013 at 9:42 PM, Yunming Zhang zhangyunming1...@gmail.com wrote: Hi, I was trying to set up Mahout 0.8 on my Macbook Pro with OSX so I could do some local testing, I am

Re: MatrixMultiplicationJob runs with 1 mapper only ?

2013-01-15 Thread Sean Owen
It's up to Hadoop in the end. Try calling FileInputFormat.setMaxInputSplitSize() with a smallish value, like your 10MB (1000). I don't know if Hadoop params can be set as sys properties like that anyway? On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi, I am

<    1   2   3   4   5   6   7   8   9   10   >