Re: seqdirectory command in MapReduce

2013-02-16 Thread Dan Filimon
Hi Claudio, Could you be more specific? What does 'MapReduce style' mean? seqdirectory should create sequence files from the documents in a folder, where the keys are the document names and the values are the documents' content. What do you need it to do? On Sat, Feb 16, 2013 at 5:55 PM,

Re: seqdirectory command in MapReduce

2013-02-16 Thread Claudio Reggiani
Let say the directory has only one big text. Logically it's one file but actually on HDFS the data is distributed among the cluster. Suppose now the big text can't stay in memory (in any memory of the cluster), does seqdirectory work? If so, the only way is to run seqdirectory as MapReduce job.

Re: seqdirectory command in MapReduce

2013-02-16 Thread Steve Chien
I think he meant that code is reading and converting the files from the Input directory as a standalone program. Not a map-reduce program... On Feb 16, 2013, at 11:22, Dan Filimon dangeorge.fili...@gmail.com wrote: Hi Claudio, Could you be more specific? What does 'MapReduce style' mean?

Re: seqdirectory command in MapReduce

2013-02-16 Thread Claudio Reggiani
Yes, thank you Steve. And sorry for my encoded messages Claudio 2013/2/16 Steve Chien stvch...@gmail.com I think he meant that code is reading and converting the files from the Input directory as a standalone program. Not a map-reduce program... On Feb 16, 2013, at 11:22, Dan Filimon

Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Ahmet Ylmaz
Hi, I have looked at the internals of Mahout's RecommenderIRStatsEvaluator code. I think that there are two important problems here. According to my understanding the experimental protocol used in this code is something like this: It takes away a certain percentage of users as test users. For

Re: seqdirectory command in MapReduce

2013-02-16 Thread Dan Filimon
But why would this be a problem? As long as it's using HDFS to access the files, it should be able to fetch the chunks from wherever they might be in the cluster. I don't see why it wouldn't work. Let us know if it works! On Sat, Feb 16, 2013 at 7:38 PM, Claudio Reggiani nop...@gmail.com wrote:

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
No, this is not a problem. Yes it builds a model for each user, which takes a long time. It's accurate, but time-consuming. It's meant for small data. You could rewrite your own test to hold out data for all test users at once. That's what I did when I rewrote a lot of this just because it was

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Ahmet Ylmaz
But modeling a user only by his/her low ratings can be problematic since people generally are more precise (I believe) in their high ratings. Another problem is that recommender algorithms in general first mean normalize the ratings for each user. Suppose that we have the following ratings of 3

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
Yes. But: the test sample is small. Using 40% of your data to test is probably quite too much. My point is that it may be the least-bad thing to do. What test are you proposing instead, and why is it coherent with what you're testing? On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Tevfik Aytekin
I think, it is better to choose ratings of the test user in a random fashion. On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote: Yes. But: the test sample is small. Using 40% of your data to test is probably quite too much. My point is that it may be the least-bad thing to do.

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
This is a good answer for evaluation of supervised ML, but, this is unsupervised. Choosing randomly is choosing the 'right answers' randomly, and that's plainly problematic. On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: I think, it is better to choose ratings

Re: seqdirectory command in MapReduce

2013-02-16 Thread Josh Patterson
look at MAHOUT-833 , this patch gives you this functionality. On Sat, Feb 16, 2013 at 10:55 AM, Claudio Reggiani nop...@gmail.com wrote: Hello, I have a text dataset. Running seqdirectory command on it I see it's not written in MapReduce style (looking at the source code of

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Tevfik Aytekin
No, rating prediction is clearly a supervised ML problem On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen sro...@gmail.com wrote: This is a good answer for evaluation of supervised ML, but, this is unsupervised. Choosing randomly is choosing the 'right answers' randomly, and that's plainly

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
Sure, if you were predicting ratings for one movie given a set of ratings for that movie and the ratings for many other movies. That isn't what the recommender problem is. Here, the problem is to list N movies most likely to be top-rated. The precision-recall test is, in turn, a test of top N

Getting java.lang.OutOfMemoryError when running mahout in sequential mode

2013-02-16 Thread Haddad Said
Hi I am having difficulties linking my two machines into a hadoop cluster so I am running mahout jobs in a single machine and I am running into java.lang.OutOfMemoryError issues when the input files are big (see outputs below, one is Java heap space and the other is GC overhead limit exceeded).

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
If you're suggesting that you hold out only high-rated items, and then sample them, then that's what is done already in the code, except without the sampling. The sampling doesn't buy anything that I can see. If you're suggesting holding out a random subset and then throwing away the held-out

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Tevfik Aytekin
I'm suggesting the second one. In that way the test user's ratings in the training set will compose of both low and high rated items, that prevents the problem pointed out by Ahmet. On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen sro...@gmail.com wrote: If you're suggesting that you hold out only

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
I understand the idea, but this boils down to the current implementation, plus going back and throwing out some additional training data that is lower rated -- it's neither in test or training. Anything's possible, but I do not imagine this is a helpful practice in general. On Sat, Feb 16, 2013

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Ted Dunning
There are a variety of common time based effects which make time splits best in many practical cases. Having the training data all be from the past emulates this better than random splits. For one thing, you can have the same user under different names in training and test. For another

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Ted Dunning
Sean I think it is still a supervised learning problem in that there is a labelled training data set and an unlabeled test data set. Learning a ranking doesn't change the basic dichotomy between supervised and unsupervised. It just changes the desired figure of merit. Sent from my iPhone

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen
The very question at hand is how to label the data as relevant and not relevant results. The question exists because this is not given, which is why I would not call this a supervised problem. That may just be semantics, but the point I wanted to make is that the reasons choosing a random training