Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair
Congrats! 2018-07-19 9:31 GMT+02:00 Peng Zhang : > Congrats Andrew! > > On Thu, Jul 19, 2018 at 04:01 Andrew Musselman > > wrote: > > > Thanks Andy, looking forward to it! Thank you too for your support and > > dedication the past two years; here's to continued progress! > > > > Best > > Andrew > > > > On Wed, Jul 18, 2018 at 1:30 PM, Andrew Palumbo > > wrote: > > > Please join me in congratulating Andrew Musselman as the new Chair of > > > the > > > Apache Mahout Project Management Committee. I would like to thank > > > Andrew > > > for stepping up, all of us who have worked with him over the years > > > know his > > > dedication to the project to be invaluable. I look forward to Andrew > > > taking taking the project into the future. > > > > > > Thank you, > > > > > > Andy > > >
Re: Does mahout 0.5 fit hadoop-0.20.2?
Please use a recent version of mahout. 0.4 and 0.5 are totally outdated. -s On 06/25/2014 09:05 AM, seabiscuit08 wrote: Hi everyone, i am new in mahout. Our hadoop cluster is hadoop-0.20.2 ,i try out mahout-distribution-0.4 lda function, and it works well. But It can't inference new document when lda estimation is over. I heard mahout0.5 has such ability,but when i try it ,it can't even create sequence file on my hdfs. Any help is appreciated!!! seabiscuit08
Re: divide a vector (sum) by a double, error
Its also not a good idea to put the vectors into a hashset, i don't think we have equals and hashcode correctly implemented for that Am 16.06.2014 18:21 schrieb Ted Dunning ted.dunn...@gmail.com: Patrice, This sounds like a classpath problem more than code error. Are you sure that you can run any program that use Mahout? Do you perhaps have two versions of Mahout floating around? Regarding the code, this is a more compact idiom for the same thing: Matrix m = ...; Vector centroid = m.aggregateColumns(new VectorFunction() { @Override public double apply(Vector f) { return f.zSum() / f.size(); } }); This uses a matrix as a container for vectors rather than a set of Vectors. If you really want to use a set, then your iteration based approach should be fine. In your code, you could also be much tighter. For instance, the last three lines could simply be: return sum.divide(vectors.size); None of the stuff with the Integer or casting is necessary. On Mon, Jun 16, 2014 at 9:01 AM, Patrice Seyed apse...@gmail.com wrote: Hi all, I have attempted to write a method centroid() that 1) sums a HashSet of org.apache.mahout.math.Vector (vectors that are DenseVector), and 2) (org.apache.mahout.math.Vector.divide) divides the summed vector by its size, as a double. I get an error: Exception in thread main java.lang.IncompatibleClassChangeError: class org.apache.mahout.math.function.Functions$1 has interface org.apache.mahout.math.function.DoubleFunction as super class I've tried this with a set of DenseVector and SequentialAccessSparseVector with the same result. Any help appreciated, the actual method is below. I noticed a class Centroid in the mahout distribution, but seems to cover a different sense of centroid than that I'm implementing here. Thanks, Patrice public Vector centroid (HashSetVector vectors){ IteratorVector it = vectors.iterator(); Vector sum = it.next(); while(it.hasNext()){ Vector aVector = it.next(); sum = sum.plus(aVector); System.out.println(sum.toString()); } Integer totalVectors = vectors.size(); double dlTotalVectors = totalVectors.doubleValue(); return sum.divide(dlTotalVectors); }
Re: Performance issues in Mahout recommendations
You should not use Hadoop for such a tiny dataset. Use the GenericItemBasedRecommender on a single machine in Java. --sebastian On 06/06/2014 11:10 AM, Warunika Ranaweera wrote: Hi, I am using Mahout's recommenditembased algorithm on a data set with nearly 10,000 (implicit) user ratings. This is the command I used: *mahout recommenditembased --input ratings.csv --output recommendation --usersFile users.dat --tempDir temp --similarityClassname SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 * Although the output is successfully generated, this process takes nearly 7 minutes to produce recommendations for a single user. The Hadoop cluster has 8 nodes and the machine on which Mahout is invoked is an AWS EC2 c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that more than one machine is *not* utilized at a time, and the *recommenditembased* command takes 9 mapreduce jobs altogether with approx. 45 seconds taken per job. Since the performance is too slow for real time recommendations, it would be really helpful to know whether I'm missing out any additional commands or configurations that enables faster performance. Thanks, Warunikay
Re: Performance issues in Mahout recommendations
1M ratings take up something like 20 megabytes. This is a datasize where it does not make any sense to use Hadoop. Just try the single machine implementation. --sebastian On 06/06/2014 12:01 PM, Warunika Ranaweera wrote: Hi Sebastian, Thanks for your prompt response. It's just a sample data set from our database and it may expand up to 6 million ratings. Since the performance was low for a smaller data set, I thought it would be even worse for a larger data set. As per your suggestion, I also applied the same command on 1 million user ratings for approx. 6000 users and got the same performance level. What is the average running time for the Mahout distributed recommendation job on 1 million ratings? Does it usually take more than 1 minute? Thanks in advance, Warunika On Fri, Jun 6, 2014 at 2:42 PM, Sebastian Schelter s...@apache.org wrote: You should not use Hadoop for such a tiny dataset. Use the GenericItemBasedRecommender on a single machine in Java. --sebastian On 06/06/2014 11:10 AM, Warunika Ranaweera wrote: Hi, I am using Mahout's recommenditembased algorithm on a data set with nearly 10,000 (implicit) user ratings. This is the command I used: *mahout recommenditembased --input ratings.csv --output recommendation --usersFile users.dat --tempDir temp --similarityClassname SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 * Although the output is successfully generated, this process takes nearly 7 minutes to produce recommendations for a single user. The Hadoop cluster has 8 nodes and the machine on which Mahout is invoked is an AWS EC2 c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that more than one machine is *not* utilized at a time, and the *recommenditembased* command takes 9 mapreduce jobs altogether with approx. 45 seconds taken per job. Since the performance is too slow for real time recommendations, it would be really helpful to know whether I'm missing out any additional commands or configurations that enables faster performance. Thanks, Warunikay
Re: Performance issues in Mahout recommendations
Mahout has single machine and distributed recommenders. On 06/06/2014 02:31 PM, Warunika Ranaweera wrote: I agree with your suggestion though. I have already implemented a Java recommender and it performed better. But, due to scalability problems that are predicted to occur in the future, we thought of moving to Mahout. However, it seems like, for now, it's better to go with the single machine implementation. Thanks for your suggestions, Warunika On Fri, Jun 6, 2014 at 3:36 PM, Sebastian Schelter s...@apache.org wrote: 1M ratings take up something like 20 megabytes. This is a datasize where it does not make any sense to use Hadoop. Just try the single machine implementation. --sebastian On 06/06/2014 12:01 PM, Warunika Ranaweera wrote: Hi Sebastian, Thanks for your prompt response. It's just a sample data set from our database and it may expand up to 6 million ratings. Since the performance was low for a smaller data set, I thought it would be even worse for a larger data set. As per your suggestion, I also applied the same command on 1 million user ratings for approx. 6000 users and got the same performance level. What is the average running time for the Mahout distributed recommendation job on 1 million ratings? Does it usually take more than 1 minute? Thanks in advance, Warunika On Fri, Jun 6, 2014 at 2:42 PM, Sebastian Schelter s...@apache.org wrote: You should not use Hadoop for such a tiny dataset. Use the GenericItemBasedRecommender on a single machine in Java. --sebastian On 06/06/2014 11:10 AM, Warunika Ranaweera wrote: Hi, I am using Mahout's recommenditembased algorithm on a data set with nearly 10,000 (implicit) user ratings. This is the command I used: *mahout recommenditembased --input ratings.csv --output recommendation --usersFile users.dat --tempDir temp --similarityClassname SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 * Although the output is successfully generated, this process takes nearly 7 minutes to produce recommendations for a single user. The Hadoop cluster has 8 nodes and the machine on which Mahout is invoked is an AWS EC2 c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that more than one machine is *not* utilized at a time, and the *recommenditembased* command takes 9 mapreduce jobs altogether with approx. 45 seconds taken per job. Since the performance is too slow for real time recommendations, it would be really helpful to know whether I'm missing out any additional commands or configurations that enables faster performance. Thanks, Warunikay
Re: Indicator Matrix and Mahout + Solr recommender
I have added the threshold merely as a way to increase the performance of RowSimilarityJob. If a threshold is given, some item pairs don't need to be looked at. A simple example is if you use cooccurrence count as similarity measure, and set a threshold of n cooccurrences, than any pair containing an item with less than n interactions can be ignored. IIRC similar techniques are implemented for cosine and jaccard. Best, Sebastian On 05/27/2014 07:08 PM, Pat Ferrel wrote: On May 27, 2014, at 8:15 AM, Ted Dunning ted.dunn...@gmail.com wrote: The threshold should not normally be used in the Mahout+Solr deployment style. Understood and that’s why an alternative way of specifying a cutoff may be a good idea. This need is better supported by specifying the maximum number of indicators. This is mathematically equivalent to specifying a fraction of values, but is more meaningful to users since good values for this number are pretty consistent across different uses (50-100 are reasonable values for most needs larger values are quite plausible). Assume you mean 50-100 as the average number per item. The total for the entire indicator matrix is what Ken was asking for. But I was thinking about the use with itemsimilarity where the user may not know the dimensionality since itemsimilarity assembles the matrix from individual prefs. The user probably knows the number of items in their catalog but the indicator matrix dimensionality is arbitrarily smaller. Currently the help reads: --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItemtry to cap the number of similar items per item to this number (default: 100) If this were actually the average # per item it would do what you describe but it looks like it’s a literal a cutoff per vector in the code. A cutoff based on the highest scores in the entire matrix seems to imply a sort when the total is larger than the average would allow and I don’t see an obvious sort being done in the MR. Anyway, it looks like we could do this by 1) total number of values in the matrix (what Ken was asking for) This requires that the user know the dimensionality of the indicator matrix to be very useful. 2) average number per item (what Ted describes) This seems the most intuitive and does not require the dimensionality be known 3) fraction of the values. This might be useful if you are more interested in downsampling by score, at least it seems more useful than —threshold as it is today but maybe I’m missing some use cases? Is there really a need for a hard score threshold? On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel pat.fer...@gmail.com wrote: I was talking with Ken Krugler off list about the Mahout + Solr recommender and he had an interesting request. When calculating the indicator/item similarity matrix using ItemSimilarityJob there is a --threshold option. Wouldn’t it be better to have an option that specified the fraction of values kept in the entire matrix based on their similarity strength? This is very difficult to do with --threshold. It would be like expressing the threshold as a fraction of total number of values rather than a strength value. Seems like this would have the effect of tossing the least interesting similarities where limiting per item (—maxSimilaritiesPerItem) could easily toss some of the most interesting. At very least it seems like a better way of expressing the threshold, doesn’t it?
Re: Theory behind LogisticRegression in Mahout
We should add these links to the LR page on the website. --s On 05/23/2014 03:20 PM, Ted Dunning wrote: Ahh... my error then. Happily, Dmitriy and others have provided the requisite links. On Thu, May 22, 2014 at 11:50 PM, namit maheshwari namitmaheshwa...@gmail.com wrote: No I didnt find any links in the comments. On Fri, May 23, 2014 at 2:44 AM, Ted Dunning ted.dunn...@gmail.com wrote: I thought that there were links in comments to documentation. Are there not? Sent from my iPhone On May 22, 2014, at 2:29, namit maheshwari namitmaheshwa...@gmail.com wrote: Hello Everyone, Could anyone please let me know the algorithm used behind LogisticRegression in Mahout. Also AdaptiveLogisticRegression mentions an *annealing* schedule. I would be grateful if someone could guide me towards the theory behind it. Thanks Namit
Re: Setting mahout heapsize for rowsimilarity job
I don't think you should use RowSimilarity job for that case, if you only have 6 columns. Can you tell us a little bit about the data and what problem your are trying to solve? --sebastian On 05/23/2014 09:03 PM, Suneel Marthi wrote: I had seen this issue too with RSJ until 0.8. Switch to using Mahout 0.9, downsampling was introduced in RSJ which should avoid this error. On Fri, May 23, 2014 at 2:59 PM, Mohit Singh mohit1...@gmail.com wrote: Hi, I have a 1M X 6 dimensional matrix stored as sequence file and I am trying to use rowSimilarity for this job... But when I try to run the job, I see Java heap space error for the second step (RowSimilarityJob-CooccurrencesMapper-Reducer) . My raw sequence file is around 700MB and then I have already set MAHOUT_OPTS to (say) 7gb? But I am still seeing that error? My command line args are: hadoop jar /usr/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob -i $INPUT -o $OUTPUT *-r 6 *-s SIMILARITY_COSINE -m 15 --tempDir $TEMP -ess Also, is this r a typo.. the help file says that this is column length? Is it column or row dimension ? Thanks -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Re: Mahout recommendation in implicit feedback situation
Alessandro, which version of Mahout are you using? I had a look at the current implementation of GenericBooleanPrefUserBasedRecommender and its doEstimatePreference method returns the sum of similarities of users that have also interacted with the item. So that should be different from either 0 or 1. --sebastian On 05/03/2014 05:00 PM, Alessandro Suglia wrote: Sorry Sebastian, maybe you haven't the possibility to read the post on SO, so I'll report the code here. I've already used the GenericBooleanPrefUserBasedRecommender in order to generate the recommendation and the results are the same. | DataModel trainModel= new FileDataModel(new File(String.valueOf(Main.class.getResource(/binarized/u1.base).getFile(; DataModel testModel= new FileDataModel(new File(String.valueOf(Main.class.getResource(/binarized/u1.test).getFile(; UserSimilarity similarity= new TanimotoCoefficientSimilarity(trainModel); UserNeighborhood neighborhood= new NearestNUserNeighborhood(35, similarity, trainModel); GenericBooleanPrefUserBasedRecommender userBased= new GenericBooleanPrefUserBasedRecommender(trainModel, neighborhood, similarity); long firstUser= testModel.getUserIDs().nextLong(); // get the first user // try to recommender items for the first user for(LongPrimitiveIterator iterItem= testModel.getItemIDsFromUser(firstUser).iterator(); iterItem.hasNext(); ) { long currItem= iterItem.nextLong(); // estimates preference for the current item for the first user System.out.println(Estimated preference for item + currItem+ is + userBased.estimatePreference(firstUser, currItem)); } | Can you explain to me where is the error in this code? Thank you. On 05/03/14 16:42, Sebastian Schelter wrote: You should try the org.apache.mahout.cf.taste.impl.recommender.GenericBooleanPrefUserBasedRecommender which has been built to handle such data. Best, Sebastian On 05/03/2014 04:34 PM, Alessandro Suglia wrote: I have described it in the SO's post: When I execute this code, the result is a list of 0.0 or 1.0 which are not useful in the context of top-n recommendation in implicit feedback context. Simply because I have to obtain, for each item, an estimated rate which stays in the range [0, 1] in order to rank the list in decreasing order and construct the top-n recommendation appropriately. On 05/03/14 16:25, Sebastian Schelter wrote: Hi Allessandro, what result do you expect and what do you get? Can you give a concrete example? --sebastian On 05/03/2014 12:11 PM, Alessandro Suglia wrote: Good morning, I've tried to create a recommender system using Mahout in an implicit feedback situation. What I'm trying to do is explained exactlly in this post on stack overflow: http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation. http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation As you can see, I'm having some problem with it simply because I cannot get the result that I expect (a value between 0 and 1) when I try to predict a score for a specific item. Someone here can help me, please? Thank you in advance. Alessandro Suglia
Re: Fwd: Mahout Naive Bayes CSV Classification
Hi Jossef, You have to vectorize and normalize your data. The input for naive bayes is a sequencefile containing a Text object as key (your label) and a VectorWritable that holds a vector with the data. Instructions to run NaiveBayes can be found here: https://mahout.apache.org/users/classification/bayesian.html --sebastian On 05/03/2014 07:40 PM, Jossef Harush wrote: I have these 2 CSV files: 1. train-set.csv 2. test-set.csv Both of them are in the same structure (with different content) and similar to this example (http://i.stack.imgur.com/jsckr.png) : [image: enter image description here] Each column is a feature and the last column - class, is the name of the class to predict. . *Can anyone please provide a sample code for:* 1. Initializing Naive Bayes with a CSV file (model creation, training, required pre-processing, etc...) 2. For a given CSV row - predicting a class Thanks! . . BTW - I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow these links: http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/ .
Re: Mahout recommendation in implicit feedback situation
Hi Allessandro, what result do you expect and what do you get? Can you give a concrete example? --sebastian On 05/03/2014 12:11 PM, Alessandro Suglia wrote: Good morning, I've tried to create a recommender system using Mahout in an implicit feedback situation. What I'm trying to do is explained exactlly in this post on stack overflow: http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation. http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation As you can see, I'm having some problem with it simply because I cannot get the result that I expect (a value between 0 and 1) when I try to predict a score for a specific item. Someone here can help me, please? Thank you in advance. Alessandro Suglia
Re: Mahout recommendation in implicit feedback situation
You should try the org.apache.mahout.cf.taste.impl.recommender.GenericBooleanPrefUserBasedRecommender which has been built to handle such data. Best, Sebastian On 05/03/2014 04:34 PM, Alessandro Suglia wrote: I have described it in the SO's post: When I execute this code, the result is a list of 0.0 or 1.0 which are not useful in the context of top-n recommendation in implicit feedback context. Simply because I have to obtain, for each item, an estimated rate which stays in the range [0, 1] in order to rank the list in decreasing order and construct the top-n recommendation appropriately. On 05/03/14 16:25, Sebastian Schelter wrote: Hi Allessandro, what result do you expect and what do you get? Can you give a concrete example? --sebastian On 05/03/2014 12:11 PM, Alessandro Suglia wrote: Good morning, I've tried to create a recommender system using Mahout in an implicit feedback situation. What I'm trying to do is explained exactlly in this post on stack overflow: http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation. http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation As you can see, I'm having some problem with it simply because I cannot get the result that I expect (a value between 0 and 1) when I try to predict a score for a specific item. Someone here can help me, please? Thank you in advance. Alessandro Suglia
Re: Future of Frequent Pattern Mining
I don't think we have to extract the code, people can pull it out of the 0.9 releases sources which are in svn. We have not heard any opposition from a production users of this code here, nor has someone stepped up to maintain this code (and we've asked for the second time), so let's finish what we already aimed for in the 0.9 release and remove it. I'll prepare a patch. --sebastian On 04/28/2014 10:52 AM, Ted Dunning wrote: One thought is to extract the code, publish on github with warnings about no support. Then if there are requests, we can point them to the GH archive and tell them to go for it. On Mon, Apr 28, 2014 at 10:03 AM, Suneel Marthi smar...@apache.org wrote: +100 to purging this from the codebase. This stuff uses the old MR api and would have to be upgraded not to mention that this was removed from 0.9 and was restored only because one user wanted it who promised to maintain it and has not been heard from. On Mon, Apr 28, 2014 at 2:19 AM, Sebastian Schelter s...@apache.org wrote: Hi, I'm resending this mail to also include the users list. To wrap up: We currently have a discussion whether our frequent pattern mining package should stay in the codebase. The original author suggested to remove the original implementation and maybe retain the FPGrowth2 implementation. I'd like to ask our users here on their opionion, is anybody opposed to removing the frequent pattern mining code from Mahout? Please shout out. --sebastian
Future of Frequent Pattern Mining
Hi, I'm resending this mail to also include the users list. To wrap up: We currently have a discussion whether our frequent pattern mining package should stay in the codebase. The original author suggested to remove the original implementation and maybe retain the FPGrowth2 implementation. I'd like to ask our users here on their opionion, is anybody opposed to removing the frequent pattern mining code from Mahout? Please shout out. --sebastian
Re: Future of Frequent Pattern Mining
Hi Michael, the problem is that currently nodoby is maintaining the fpgrowth code anymore or working on documentation for it, that's why we consider it to be a candidate for removal. I don't see much value in keeping algorithms in the codebase if nobody is maintaining them, answering questions and providing documentation. If someone opposes here who has that code in production, that could be a reason to retain it however. People wanting to use the code in the future can always download Mahout 0.9 which has the current implementation. --sebastian On 04/28/2014 08:23 AM, Michael Wechner wrote: what is the alternative and if one would still want to use the frequent pattern mining code in the future, how would this be possible otherwise? Thanks Michael Am 28.04.14 08:19, schrieb Sebastian Schelter: Hi, I'm resending this mail to also include the users list. To wrap up: We currently have a discussion whether our frequent pattern mining package should stay in the codebase. The original author suggested to remove the original implementation and maybe retain the FPGrowth2 implementation. I'd like to ask our users here on their opionion, is anybody opposed to removing the frequent pattern mining code from Mahout? Please shout out. --sebastian
Re: Reading the wiki
Would someone be willing to open a jira ticket for this issue and fix the problem? --sebastian On 04/28/2014 01:05 AM, Ted Dunning wrote: Mathjax is both static content and server. There is an FAQ about this https problem. I think that part of the issue is that they don't use the same URL for both http and https connections. http://www.mathjax.org/resources/faqs/#problem-https The URL that they suggest to use for getting mathjax.js is https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js This is required because the rackspace cdn won't allow the http address to be used with https. Perversely, this https address also breaks when used with http. My guess is that if we update our css/headers/templates to use this https address then things will work. On Sun, Apr 27, 2014 at 11:59 PM, Dmitriy Lyubimov dlie...@gmail.comwrote: i think we would have to host mathjax to apease the browsers under https handshake. I am not sure what would be associated with that, I am not sure if mathjax is solely static content or it is an actual server doing something. On Sun, Apr 27, 2014 at 12:41 AM, Sebastian Schelter s...@apache.org wrote: What if we store a copy of the js file on our site and also serve it via https? On 04/27/2014 05:34 AM, Pat Ferrel wrote: Often CMSs have a way to configure https access to be used only for password or other secure areas of the site. No idea if the Apache CMS does this but worth asking. If there is no https fix seems like Mathjax should be discontinued. On Apr 26, 2014, at 8:03 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I have no solution for https. It is most likely security thing. I just asked that whomever writes blog to fix https links to simple unsecure ones. On Apr 26, 2014 6:24 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: There was chat last week about this breaking, something about https vs http in the link to Mathjax as I recall. Dmitriy was dealing with it last I saw. On Apr 26, 2014, at 6:04 PM, Pat Ferrel p...@occamsmachete.com wrote: I probably missed some announcement but why is the math markup coming out raw? Do I need a plugin or something? \[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\ mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\ boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]
Re: Future of Frequent Pattern Mining
I'm very much in favor of this idea. On 04/28/2014 10:52 AM, Ted Dunning wrote: One thought is to extract the code, publish on github with warnings about no support. Then if there are requests, we can point them to the GH archive and tell them to go for it. On Mon, Apr 28, 2014 at 10:03 AM, Suneel Marthi smar...@apache.org wrote: +100 to purging this from the codebase. This stuff uses the old MR api and would have to be upgraded not to mention that this was removed from 0.9 and was restored only because one user wanted it who promised to maintain it and has not been heard from. On Mon, Apr 28, 2014 at 2:19 AM, Sebastian Schelter s...@apache.org wrote: Hi, I'm resending this mail to also include the users list. To wrap up: We currently have a discussion whether our frequent pattern mining package should stay in the codebase. The original author suggested to remove the original implementation and maybe retain the FPGrowth2 implementation. I'd like to ask our users here on their opionion, is anybody opposed to removing the frequent pattern mining code from Mahout? Please shout out. --sebastian
Re: Reading the wiki
What if we store a copy of the js file on our site and also serve it via https? On 04/27/2014 05:34 AM, Pat Ferrel wrote: Often CMSs have a way to configure https access to be used only for password or other secure areas of the site. No idea if the Apache CMS does this but worth asking. If there is no https fix seems like Mathjax should be discontinued. On Apr 26, 2014, at 8:03 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I have no solution for https. It is most likely security thing. I just asked that whomever writes blog to fix https links to simple unsecure ones. On Apr 26, 2014 6:24 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: There was chat last week about this breaking, something about https vs http in the link to Mathjax as I recall. Dmitriy was dealing with it last I saw. On Apr 26, 2014, at 6:04 PM, Pat Ferrel p...@occamsmachete.com wrote: I probably missed some announcement but why is the math markup coming out raw? Do I need a plugin or something? \[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]
Welcome Pat Ferrel as new committer on Mahout
Hi, this is to announce that the Project Management Committee (PMC) for Apache Mahout has asked Pat Ferrel to become committer and we are pleased to announce that he has accepted. Being a committer enables easier contribution to the project since in addition to posting patches on JIRA it also gives write access to the code repository. That also means that now we have yet another person who can commit patches submitted by others to our repo *wink* Pat, we look forward to working with you in the future. Welcome! It would be great if you could introduce yourself with a few words. -s
Re: Spark Mahout with a CLI?
I'll create a jira ticket for this, as I have a little time to work on it. On 04/16/2014 08:15 PM, Pat Ferrel wrote: bug in the pseudo code, should use columnIds: val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), hashedDrms(1).columnIds()) RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, hdfs://some/path/for/output”) On Apr 16, 2014, at 10:00 AM, Pat Ferrel p...@occamsmachete.com wrote: Great, and an excellent example is at hand. In it I will play the user and contributor role, Sebastian and Dmitriy the commiter/scientist role. I have a web site that uses a Mahout+Solr recommender—the video recommender demo site. This creates logfiles of the form timestamp, userId, itemId, action timestamp1, userIdString1, itemIdString1, “view timestamp2, userIdString2, itemIdString1, “like These are currently processed using the Solr-recommender example code and Hadoop Mahout. The input is split and accumulated into two matrices which could then be input to the new Spark cooccurrence analysis code (see the patch here: https://issues.apache.org/jira/browse/MAHOUT-1464) val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef, maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmA)) What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible job that takes the above logfile input and creates a HashedSparseMatrix. inside the HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and column external Id - mahout Id lookup. The ‘cooccurrences' call would be identical and the data it deals with would also be identical. But the HashedSparseMatrix would be able to deliver two dictionaries, which store the dimensions length and are used to lookup string Ids from internal mahout ordinal integer Ids. These could be created with a helper function to read from logfiles. val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, “^actions-.*“, \t”, 1, 2, “like”, “view”) Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB. When the output is written to a text file it will be creating a new HasedSparceMatrix from the cooccurrences indicator matrix and the original itemId dictionaries: val hashedCrossIndicatorMatrix = new HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), hasedDrms(1).rowIds()) RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, hdfs://some/path/for/output) Here the two Id dictionaries are used to create output file(s) with external Ids. Since I already have to do this for the demo site using Hadoop Mahout I’ll have to create a Spark impl of the wrapper for the new cross-cooccurrence indicator matrix. And since my scripting/web app language is not Scala the format for the output needs to be text. I think this meets all issues raised here. No unnecessary import/export. Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a HashedSparseMatrix, The internal calculations are done on RDDs and the drms are never written to disk. AND the logfiles can be consumed directly producing data that any language can consume directly with external Ids used and preserved. BTW: in the MAHOUT-1464 example the drms are read in serially single threaded but written out using Spark (unless I missed something). In the proposed impl the read and write would be Sparkified. BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using cron directly with no additional processing pipeline and by people unfamiliar with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the demo site but with a lot of non-Mahout code. BTW3: This type of thing IMO must be done for any Mahout job we want to be widely used. Otherwise we leave all of this wrapper code to be duplicated over and over again buy users and expect them to know too much about Spark Mahout internals. On Apr 15, 2014, at 6:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: Well... I think it is an issue that has to do with figuring out how to *avoid* import and export as much as possible. On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel p...@occamsmachete.com wrote: Which is why it’s an import/export issue. On Apr 15, 2014, at 5:48 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel p...@occamsmachete.com wrote: As to the statement There is not, nor do i think there will be a way to run this stuff with CLI” seems unduly misleading. Really, does anyone second this? There will be Scala scripts to drive this stuff and yes even from the CLI. Do you imagine that every Mahout USER will be a Scala + Mahout DSL programmer? That may be fine for commiters but users will be PHP devs, Ruby
Re: org.apache.mahout.math.IndexException
Yes, it should give you the necessary information. The important part is this: Apply the patch with patch -p 0 -i path to patch Throw a --dry-run on there if you want to see what happens w/o screwing up your checkout. On 04/20/2014 09:47 PM, Mario Levitin wrote: Thanks Sebastian, I have not applied a patch before. I found the following page http://mahout.apache.org/developers/patch-check-list.html is that description enough for applying a patch? On Sat, Apr 19, 2014 at 2:23 AM, Sebastian Schelter s...@apache.org wrote: Mario, could you check whether the patch from https://issues.apache.org/ jira/browse/MAHOUT-1517 fixes your problem? Best, Sebastian On 04/18/2014 11:03 PM, Mario Levitin wrote: In my dataset ID's are strings so I use MemoryIDMigrator. This migrator produces large longs. I'm not doing any translation. I could not understand why there is a cast to int in the Mahout code. This will produce errors for large long values. On Fri, Apr 18, 2014 at 8:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: Are you translating the ID's down into a range that will fit into int's? On Thu, Apr 17, 2014 at 3:02 PM, Mario Levitin mariolevi...@gmail.com wrote: Hi, I'm trying to run the ALS algorithm. However, I get the following error: Exception in thread pool-1-thread-3 org.apache.mahout.math.IndexException: Index -691877539 is outside allowable range of [0,2147483647) at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:395) at org.apache.mahout.cf.taste.impl.recommender.svd.ALSWRFactorizer. sparseUserRatingVector(ALSWRFactorizer.java:305) At line 305 in ALSWRFactorizer.java, there is the following code ratings.set((int) preference.getItemID(), preference.getValue()); My suspicion is that the error results from the casting to int in the above line. Item IDs in mahout are long, so if you cast a long (which does not fit into an int) then you will get negative numbers and hence the error. However, this explanation also seems to me implausible since I don't think such an error exists in Mahout code. Any help will be appreciated. Thanks
Re: simple idea for improving mahout docs over the next month?
Hm, I'm not so sure whether introducing another source for documentation than the webpage would be so helpful (there still lots of work to do on the website...), how do others see this? --sebastian On 04/17/2014 05:06 PM, Jay Vyas wrote: Hi sebastian: theoretically, one could extract all the information from a mailing list search but i think a rolling FAQ would much more (1) be likely evolve into real documentation and (2) be more easily refined . Is that a little convincing ? If not i guess we can table the idea/// just a thought. On Thu, Apr 17, 2014 at 1:38 AM, Sebastian Schelter s...@apache.org wrote: Hi Jay, I'm not sure what the benefit of this approach is, people can already post their questions to the mailinglist and get answers here, why would a google doc be helpful? --sebastian On 04/16/2014 09:31 PM, Jay Vyas wrote: hi mahout... i finally thought of a really easy way of ad-hoc improvement of mahout docs, that can feed into the efforts to get formal docs improved. Any interest in creating a shared mahout FAQ file in a google doc.? we can easily start adding questions into it that point to obvious missing documentation parts, and mahout commiters can add responses below inline. then overtime we can take those questions/answers and turn them directly into real docs. I think this will make it easier for a broader range of people to rapidly improve mahout docs in an ad hoc sort of way. i for one will volunteer to help translate the QA stream into real documentation / JIRAs etc.
Re: Performance Issue using item-based approach!
You can, but you shouldn't :) On 04/18/2014 07:23 PM, Ted Dunning wrote: You can always run Hadoop in a local mode. Nothing prevents a single node from being a cluster. :-) On Thu, Apr 17, 2014 at 7:43 AM, Najum Ali naju...@googlemail.com wrote: Ted, Is it also possible to use ItemSimilarityJob in a non-distributed environment? Am 17.04.2014 um 16:22 schrieb Ted Dunning ted.dunn...@gmail.com: Najum, You should also be able to use the ItemSimilarityJob to compute a limited indicator set. This is stepping off of the path you have been on, but it would allow you to deploy the recommender via a search engine. That makes a lot of code simply vanish. THis is also a well trod production path. On Thu, Apr 17, 2014 at 3:57 AM, Najum Ali naju...@googlemail.com wrote: @Sebastian wow … you are right. The original csv file is about 21mb and the corresponding precomputed item-item similarity file is about 260mb!! And yes, there are wide more than 50 most similar items“ for an item .. Trying to restrict this to 50 (or something like that) most similar items for an item could do the trick as you said. Ok I will give it try and reply later. By the way, what´s about the SampingCandidateItemsStrategy or something like this, by using this Constructor: *GenericItemBasedRecommender https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.html#GenericItemBasedRecommender(org.apache.mahout.cf.taste.model.DataModel,%20org.apache.mahout.cf.taste.similarity.ItemSimilarity,%20org.apache.mahout.cf.taste.recommender.CandidateItemsStrategy,%20org.apache.mahout.cf.taste.recommender.MostSimilarItemsCandidateItemsStrategy) * (DataModel https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/model/DataModel.html dataModel, ItemSimilarity https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/similarity/ItemSimilarity.html similarity, CandidateItemsStrategy https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/CandidateItemsStrategy.html candidateItemsStrategy,MostSimilarItemsCandidateItemsStrategy https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/MostSimilarItemsCandidateItemsStrategy.html mostSimilarItemsCandidateItemsStrategy) Am 17.04.2014 um 12:41 schrieb Sebastian Schelter s...@apache.org: Hi Najum, I think I found the problem. Remember: Two items are similar whenever at least one user interacted with both of them (the items co-occur). In the movielens dataset this is true for almost all pairs of items, unfortunately. From 3076 items, more than 11 million similarities are created. A common approach for that (which is not yet implemented in our precomputation unfortunately) is to only retain the top-k similar items per item. A solution would be to take the csv file that is created by the MultithreadedBatchItemSimilarities and postprocess it so that only the 50 most similar items per item are retained. That should help with your problem. Unfortunately, we don't have code for that yet, maybe you want to try to write that yourself? Best, Sebastian PS: The user-based recommender restricts the number of similar users, I guess thats why it is so fast here. On 04/17/2014 12:18 PM, Najum Ali wrote: Ok, here you go: I have created a simple class with main-method (no server and other stuff): public class RecommenderTest { public static void main(String[] args) throws IOException, TasteException { DataModel dataModel = new FileDataModel(new File(/Users/najum/Documents/recommender-console/src/main/webapp/resources/preference_csv/1mil.csv)); ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel); ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, similarity); String pathToPreComputedFile = preComputeSimilarities(recommender, dataModel.getNumItems()); InputStream inputStream = new FileInputStream(new File(pathToPreComputedFile)); BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream)); CollectionGenericItemSimilarity.ItemItemSimilarity correlations = bufferedReader.lines().map(mapToItemItemSimilarity).collect(Collectors.toList()); ItemSimilarity precomputedSimilarity = new GenericItemSimilarity(correlations); ItemBasedRecommender recommenderWithPrecomputation = new GenericItemBasedRecommender(dataModel, precomputedSimilarity); recommend(recommender); recommend(recommenderWithPrecomputation); } private static String preComputeSimilarities(ItemBasedRecommender recommender, int simItemsPerItem) throws TasteException { String pathToAbsolutePath = ; try { File resultFile = new File(System.getProperty(java.io.tmpdir), similarities.csv); if (resultFile.exists()) { resultFile.delete(); } BatchItemSimilarities batchJob = new MultithreadedBatchItemSimilarities(recommender, simItemsPerItem); int numSimilarities
Re: Installation on Ubuntu
Which version do you use, it shouldn't be a problem with oracle java. --sebastian On 04/18/2014 09:39 PM, Christopher Eugene wrote: Hello, I want to install mahout on Ubuntu 14.04. I had previously tried in vain to install on 13.10. Could the version of Java be the problem? I am compiling from source. Any help will be appreciated.
Re: Installation on Ubuntu
That is wrong, but you could use a server such as PredictionIO (which uses Mahout internally) with PHP. --sebastian On 04/18/2014 09:49 PM, Christopher Eugene wrote: @sebastian I have version 1.7. @Andrew I plan on using mahout with php since I heard that there is a new API or am I wrong? On Fri, Apr 18, 2014 at 10:45 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: [image: Boxbe] https://www.boxbe.com/overview This message is eligible for Automatic Cleanup! (andrew.mussel...@gmail.com) Add cleanup rulehttps://www.boxbe.com/popup?url=https%3A%2F%2Fwww.boxbe.com%2Fcleanup%3Ftoken%3DmHSwpoBQ%252B6%252FJ3fW9yUA910ycGPeUT52Q%252Fal25IyYKsdhPwMs0QIM107VdsJQmYwJIZUxElWJcJOFczNqRvadXgKw58KV6DBHGzisKUyc7%252FXdNTfzycKNF8q7TqaJZzQWsiKseZB4uiAuGRbLb4mQVQ%253D%253D%26key%3DLq7NFbPs6NRMzQNN67fbd1t58GhHGdt2F%252F7YgWWx158%253Dtc_serial=16968089574tc_rand=991651927utm_source=stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001| More infohttp://blog.boxbe.com/general/boxbe-automatic-cleanup?tc_serial=16968089574tc_rand=991651927utm_source=stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001 I would say if you want to get started, just grab the pre-built version via the download button on the home page of http://mahout.apache.org E.g., following those links you would end up here: http://apache.cs.utah.edu/mahout/0.9 and then get either the -src or non--src version and use the pre-built jars and examples. On Fri, Apr 18, 2014 at 12:39 PM, Christopher Eugene xriseug...@gmail.comwrote: Hello, I want to install mahout on Ubuntu 14.04. I had previously tried in vain to install on 13.10. Could the version of Java be the problem? I am compiling from source. Any help will be appreciated. -- Omar Christopher Eugene http://about.me/mojo706
Re: Installation on Ubuntu
You can, but I'm not sure how much we can help you. Give it a try :) On 04/18/2014 10:11 PM, Christopher Eugene wrote: sorry I thought I replied to it :). I can ask predictionio related questions on the list too? On Fri, Apr 18, 2014 at 11:06 PM, Sebastian Schelter s...@apache.org wrote: Please reply to the list, not to me in person :) On 04/18/2014 10:05 PM, Christopher Eugene wrote: Thank you Sebastian, I could've sworn I saw something involving mahout and php not so long ago. Quick question are all the methods available in mahout available on PREDICTIONIO? On Fri, Apr 18, 2014 at 10:53 PM, Sebastian Schelter s...@apache.org wrote: [image: Boxbe] https://www.boxbe.com/overview This message is eligible for Automatic Cleanup! (s...@apache.org) Add cleanup rule https://www.boxbe.com/popup?url=https%3A%2F%2Fwww. boxbe.com%2Fcleanup%3Ftoken%3DI1jJlussgKo%252FgNnu0piiTjSz4XM0mnIqukN5wT %252BQRNmLPkyWOH0REpeI8f1ieFq90qMLvqA8YMt1NSyh5v7uv5blLasRGnu Tyw%252F4uVI3zs%252BXKaoEm2vHJk54%252F1sEmGkvry98ht1MW0M%253D% 26key%3Dv33YAIUda%252F72bTRCeq4yfV92BTK%252FJZM1xG3rsd7W2bY%253Dtc_ serial=16968129293tc_rand=1599246981utm_source=stfutm_ medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001| More infohttp://blog.boxbe.com/general/boxbe-automatic- cleanup?tc_serial=16968129293tc_rand=1599246981utm_source= stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001 That is wrong, but you could use a server such as PredictionIO (which uses Mahout internally) with PHP. --sebastian On 04/18/2014 09:49 PM, Christopher Eugene wrote: @sebastian I have version 1.7. @Andrew I plan on using mahout with php since I heard that there is a new API or am I wrong? On Fri, Apr 18, 2014 at 10:45 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: [image: Boxbe] https://www.boxbe.com/overview This message is eligible for Automatic Cleanup! (andrew.mussel...@gmail.com) Add cleanup rule https://www.boxbe.com/popup?url=https%3A%2F%2Fwww. boxbe.com%2Fcleanup%3Ftoken%3DmHSwpoBQ%252B6% 252FJ3fW9yUA910ycGPeUT52Q% 252Fal25IyYKsdhPwMs0QIM107VdsJQmYwJIZUxElWJcJOFczNqRvadXgKw5 8KV6DBHGzisKUyc7%252FXdNTfzycKNF8q7TqaJZzQWsiKs eZB4uiAuGRbLb4mQVQ%253D%253D%26key%3DLq7NFbPs6NRMzQNN67fbd1t58GhH Gdt2F%252F7YgWWx158%253Dtc_serial=16968089574tc_rand= 991651927utm_source=stfutm_medium=emailutm_campaign= ANNO_CLEANUP_ADDutm_content=001| More infohttp://blog.boxbe.com/general/boxbe-automatic- cleanup?tc_serial=16968089574tc_rand=991651927utm_source= stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001 I would say if you want to get started, just grab the pre-built version via the download button on the home page of http://mahout.apache.org E.g., following those links you would end up here: http://apache.cs.utah.edu/mahout/0.9 and then get either the -src or non--src version and use the pre-built jars and examples. On Fri, Apr 18, 2014 at 12:39 PM, Christopher Eugene xriseug...@gmail.comwrote: Hello, I want to install mahout on Ubuntu 14.04. I had previously tried in vain to install on 13.10. Could the version of Java be the problem? I am compiling from source. Any help will be appreciated. -- Omar Christopher Eugene http://about.me/mojo706
Re: org.apache.mahout.math.IndexException
Hi Mario, this is indeed a bug. The problem is that the CF code (taste) uses long ids, while our math library internally uses int keys. I'll open a jira and post patch that will hopefully help you. --sebastian On 04/18/2014 11:03 PM, Mario Levitin wrote: In my dataset ID's are strings so I use MemoryIDMigrator. This migrator produces large longs. I'm not doing any translation. I could not understand why there is a cast to int in the Mahout code. This will produce errors for large long values. On Fri, Apr 18, 2014 at 8:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: Are you translating the ID's down into a range that will fit into int's? On Thu, Apr 17, 2014 at 3:02 PM, Mario Levitin mariolevi...@gmail.com wrote: Hi, I'm trying to run the ALS algorithm. However, I get the following error: Exception in thread pool-1-thread-3 org.apache.mahout.math.IndexException: Index -691877539 is outside allowable range of [0,2147483647) at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:395) at org.apache.mahout.cf.taste.impl.recommender.svd.ALSWRFactorizer.sparseUserRatingVector(ALSWRFactorizer.java:305) At line 305 in ALSWRFactorizer.java, there is the following code ratings.set((int) preference.getItemID(), preference.getValue()); My suspicion is that the error results from the casting to int in the above line. Item IDs in mahout are long, so if you cast a long (which does not fit into an int) then you will get negative numbers and hence the error. However, this explanation also seems to me implausible since I don't think such an error exists in Mahout code. Any help will be appreciated. Thanks
Re: org.apache.mahout.math.IndexException
Mario, could you check whether the patch from https://issues.apache.org/jira/browse/MAHOUT-1517 fixes your problem? Best, Sebastian On 04/18/2014 11:03 PM, Mario Levitin wrote: In my dataset ID's are strings so I use MemoryIDMigrator. This migrator produces large longs. I'm not doing any translation. I could not understand why there is a cast to int in the Mahout code. This will produce errors for large long values. On Fri, Apr 18, 2014 at 8:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: Are you translating the ID's down into a range that will fit into int's? On Thu, Apr 17, 2014 at 3:02 PM, Mario Levitin mariolevi...@gmail.com wrote: Hi, I'm trying to run the ALS algorithm. However, I get the following error: Exception in thread pool-1-thread-3 org.apache.mahout.math.IndexException: Index -691877539 is outside allowable range of [0,2147483647) at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:395) at org.apache.mahout.cf.taste.impl.recommender.svd.ALSWRFactorizer.sparseUserRatingVector(ALSWRFactorizer.java:305) At line 305 in ALSWRFactorizer.java, there is the following code ratings.set((int) preference.getItemID(), preference.getValue()); My suspicion is that the error results from the casting to int in the above line. Item IDs in mahout are long, so if you cast a long (which does not fit into an int) then you will get negative numbers and hence the error. However, this explanation also seems to me implausible since I don't think such an error exists in Mahout code. Any help will be appreciated. Thanks
Re: simple idea for improving mahout docs over the next month?
Hi Najum, please write a new mail to ask a question and don't reply to an unrelated thread -- https://people.apache.org/~hossman/#threadhijack If you write a new mail, I'm sure we can help you with your recommender problem. Can you give us a few more details, such as the similarity that you used, how you did the precomputation and how you exactly measure the response time? --sebastian On 04/17/2014 10:49 AM, Najum Ali wrote: Hi guys, I´m pretty much new to mahout and I´m working with this problem here: I have created a precomputed item-item-similarity collection for a GenericItemBasedRecommender. Using the 1M MovieLens data, my item-based recommender is only 40-50% faster than without precomputation (like 589.5ms instead 1222.9ms). But the user-based recommender instead is really fast, it´s like 24.2ms? How can this happen? Why is item-based so slow?
Re: Performance Issue using item-based approach!
Could you take the output of the precomputation, feed it into a standalone recommender and test it there? On 04/17/2014 11:37 AM, Najum Ali wrote: @sebastian Are you sure that the precomputation is done only once and not in every request? Yes, a @Bean annotated Object is in Spring per default a singleton instance. I also just tested it out using a System.out.println() Here is my log: System.out.println( precomputation done!“ is called before returning the GenericItemSimilarity. The first two recommendations are Item-based - pearson similarity The thrid and 4th log are also item-based using pre computed similarity The last log is the userbased recommender using pearson Look at the huge time difference! Am 17.04.2014 um 11:23 schrieb Sebastian Schelter s...@apache.org mailto:s...@apache.org: Najum, this is really strange, feeding an ItemBased Recommender with precomputed similarities should give you superfast recommendations. Are you sure that the precomputation is done only once and not in every request? --sebastian On 04/17/2014 11:17 AM, Najum Ali wrote: Hi guys, I have created a precomputed item-item-similarity collection for a GenericItemBasedRecommender. Using the 1M MovieLens data, my item-based recommender is only 40-50% faster than without precomputation (like 589.5ms instead 1222.9ms). But the user-based recommender instead is really fast, it´s like 24.2ms? How can this happen? Here are more details to my Implementation: CSV File: 1M pref, 6040 Users, 3706 Items For my Implementation I´m using screenshots, because having the good highlighting. My Recommender runs inside a Webserver (Jetty) using Spring 4 and Java8. I receive Recommendations as Webservice (JSON). For DataModel, I´m using FileDataModel. This code below creates me a precomputed ItemSimilarity when I start the Webserver and the property isItemPreComputationEnabled is set to true: For time measuring I´m using AOP. I´m measuring the whole time from entering my Controller to sending the response. based on System.nanoTime(); and getting the diff. It´s the same time measure for user based. I haved tried to cache the recommender and the similarity with no big difference. I also tried to use CandidateItemsStrategy and MostSimilarItemsCandidateItemsStrategy, but also no performance boost. public RecommenderBuilder createRecommenderBuilder(ItemSimilarity similarity) throws TasteException { final int numberOfUsers = dataModel.getNumUsers(); final int numberOfItems = dataModel.getNumItems(); CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems); MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy = new SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems); return model - new GenericItemBasedRecommender(model, similarity,candidateItemsStrategy,mostSimilarStrategy); } I dont know why item-based is taking so much longer then user-based. User-based is like fast as hell. I even tried a DataSet using 100k Prefs, and 10Million (Movielens). Everytime the user-based is soo much faster for any similarity. Hope you anyone can help me to understand this. Maybe I´m doing something wrong. Thanks!! :))
Re: Performance Issue using item-based approach!
Yes, just to make sure the problem is in the mahout code and not in the surrounding environment. On 04/17/2014 11:43 AM, Najum Ali wrote: @Sebastian What do u mean with a standalone recommender? A simple offline java main program? Am 17.04.2014 um 11:41 schrieb Sebastian Schelter s...@apache.org: Could you take the output of the precomputation, feed it into a standalone recommender and test it there? On 04/17/2014 11:37 AM, Najum Ali wrote: @sebastian Are you sure that the precomputation is done only once and not in every request? Yes, a @Bean annotated Object is in Spring per default a singleton instance. I also just tested it out using a System.out.println() Here is my log: System.out.println( precomputation done!“ is called before returning the GenericItemSimilarity. The first two recommendations are Item-based - pearson similarity The thrid and 4th log are also item-based using pre computed similarity The last log is the userbased recommender using pearson Look at the huge time difference! Am 17.04.2014 um 11:23 schrieb Sebastian Schelter s...@apache.org mailto:s...@apache.org: Najum, this is really strange, feeding an ItemBased Recommender with precomputed similarities should give you superfast recommendations. Are you sure that the precomputation is done only once and not in every request? --sebastian On 04/17/2014 11:17 AM, Najum Ali wrote: Hi guys, I have created a precomputed item-item-similarity collection for a GenericItemBasedRecommender. Using the 1M MovieLens data, my item-based recommender is only 40-50% faster than without precomputation (like 589.5ms instead 1222.9ms). But the user-based recommender instead is really fast, it´s like 24.2ms? How can this happen? Here are more details to my Implementation: CSV File: 1M pref, 6040 Users, 3706 Items For my Implementation I´m using screenshots, because having the good highlighting. My Recommender runs inside a Webserver (Jetty) using Spring 4 and Java8. I receive Recommendations as Webservice (JSON). For DataModel, I´m using FileDataModel. This code below creates me a precomputed ItemSimilarity when I start the Webserver and the property isItemPreComputationEnabled is set to true: For time measuring I´m using AOP. I´m measuring the whole time from entering my Controller to sending the response. based on System.nanoTime(); and getting the diff. It´s the same time measure for user based. I haved tried to cache the recommender and the similarity with no big difference. I also tried to use CandidateItemsStrategy and MostSimilarItemsCandidateItemsStrategy, but also no performance boost. public RecommenderBuilder createRecommenderBuilder(ItemSimilarity similarity) throws TasteException { final int numberOfUsers = dataModel.getNumUsers(); final int numberOfItems = dataModel.getNumItems(); CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems); MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy = new SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems); return model - new GenericItemBasedRecommender(model, similarity,candidateItemsStrategy,mostSimilarStrategy); } I dont know why item-based is taking so much longer then user-based. User-based is like fast as hell. I even tried a DataSet using 100k Prefs, and 10Million (Movielens). Everytime the user-based is soo much faster for any similarity. Hope you anyone can help me to understand this. Maybe I´m doing something wrong. Thanks!! :))
Re: Is there any website documentation repository or tool for Apache Mahout?
The templates for the individual pages are in the svn under site/ in markdown format. You can use an online markdown editor to approximately see how they look like. We don't have a better solution yet, unfortunately. --sebastian Am 17.04.2014 20:09 schrieb Andrew Musselman andrew.mussel...@gmail.com: The content of the main part of each page is written in markdown and parsed by the CMS to render the HTML. I'm not aware of a way to submit pages except as patches.. On Apr 17, 2014, at 1:52 PM, Pat Ferrel p...@occamsmachete.com wrote: +1 the project uses Confluence for the wiki. All but commiters are blocked from editing pages. This is getting increasingly frustrating. How many tickets and patches are being passed around now? I can’t follow them all. I haven’t used Confluence for 4-5 years now but there must be some way to allow edits and new pages from anyone pending approval to publish? On Apr 17, 2014, at 4:47 AM, tuxdna tux...@gmail.com wrote: I have seen the instructions here[1], but I am not sure if there is any source-code for the documentation for website. So here are my questions: * Does Apache Mahout project use any tool to generate website documentation as it is now http://mahout.apache.org ? * Suppose I want to add some correction or edition to current Apache Mahout documentation. Can I get a read-only access to the source of website, so that I can immediately see how the edits will reflect once there are accepted? I was thinking in terms of the way GitHub pages work. For example if I use Jekyll, I can view the changes on my machine, exactly as the will appear on final website. Regards, Saleem [1] http://mahout.apache.org/developers/how-to-update-the-website.html [2] https://pages.github.com/ [3] http://jekyllrb.com/
Re: simple idea for improving mahout docs over the next month?
Hi Jay, I'm not sure what the benefit of this approach is, people can already post their questions to the mailinglist and get answers here, why would a google doc be helpful? --sebastian On 04/16/2014 09:31 PM, Jay Vyas wrote: hi mahout... i finally thought of a really easy way of ad-hoc improvement of mahout docs, that can feed into the efforts to get formal docs improved. Any interest in creating a shared mahout FAQ file in a google doc.? we can easily start adding questions into it that point to obvious missing documentation parts, and mahout commiters can add responses below inline. then overtime we can take those questions/answers and turn them directly into real docs. I think this will make it easier for a broader range of people to rapidly improve mahout docs in an ad hoc sort of way. i for one will volunteer to help translate the QA stream into real documentation / JIRAs etc.
Documentation, Documentation, Documentation
Hi, this is another reminder that we still have to finish our documentation improvements! The website looks shiny now and there have been lots of discussions about new directions but we still have some work todo in cleaning up webpages. We should especially make sure that the examples work. Please help with that, anyone who is willing to sacrifice some time, go through a website and try out the steps described is of great help to the project. It would also be awesome to get some help in creating a few new pages, especially for the recommenders. Here's the list of documentation related jira's for 1.0: https://issues.apache.org/jira/browse/MAHOUT-1441?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Documentation%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC Best, Sebastian
Re: PreferenceArray userID uniqeness?
Yes, its a unique identifier for a user. --sebastian On 04/11/2014 04:41 PM, Mike Summers wrote: Does the userId of a preferenceArray need to be unique across all entries in a FastByIDMap? I'm comparing two types of objects that contain the same set of traits however it's possible that the userID (primary key) is not unique as it's two db tables. Thanks.
Re: Best practice for partial cartesian product
I don't know a good name for that. The problems is that a quadratic amount of pairs needs to be emitted here. In our collaborative filtering code, we solve this through downsampling. --sebastian On 04/08/2014 10:08 AM, Reinis Vicups wrote: Hi, this is not mahout question directly, but I figured that you guys most likely can answer it. Actually I have two questions: 1. This: {(1,2); (1,3); (2,3)} is not full cartesian product, right? It is missing (1,1); (2,2); (3,3); (2,1); My question is - how is it called? Partial cartesian? Asymetric cartesian? 2. If I try to build the product I described above in reducer, what would be the best practice? My current code look like this: @Override public void reduce(final VarLongWritable key, final IterableVarLongWritable values, final Context context) { final VarLongWritable[] valueArray = Iterables.toArray(values, VarLongWritable.class); for (int i = 0; i valueArray.length; i++) { for (int j = i + 1; j valueArray.length; j++) { context.write(new PairWritable(valueArray[i].get(), valueArray[j].get()), customerPreferenceWritable); } } } I don't feel quite right with this solution since I make a copy of values in valueArray and believe that it will cost me OoutOfMemoryExceptions with larger data sets. thanks and br reinis
Re: Best practice for partial cartesian product
Have a look at the sampleDown method in RowSimilarityJob: https://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/RowSimilarityJob.java?view=markup On 04/08/2014 10:33 AM, Reinis Vicups wrote: Sebastian, thank your very much for your response. Could you or anyone point me to the mahout classes where this is being solved? thank you guys reinis On 08.04.2014 10:27, Sebastian Schelter wrote: I don't know a good name for that. The problems is that a quadratic amount of pairs needs to be emitted here. In our collaborative filtering code, we solve this through downsampling. --sebastian On 04/08/2014 10:08 AM, Reinis Vicups wrote: Hi, this is not mahout question directly, but I figured that you guys most likely can answer it. Actually I have two questions: 1. This: {(1,2); (1,3); (2,3)} is not full cartesian product, right? It is missing (1,1); (2,2); (3,3); (2,1); My question is - how is it called? Partial cartesian? Asymetric cartesian? 2. If I try to build the product I described above in reducer, what would be the best practice? My current code look like this: @Override public void reduce(final VarLongWritable key, final IterableVarLongWritable values, final Context context) { final VarLongWritable[] valueArray = Iterables.toArray(values, VarLongWritable.class); for (int i = 0; i valueArray.length; i++) { for (int j = i + 1; j valueArray.length; j++) { context.write(new PairWritable(valueArray[i].get(), valueArray[j].get()), customerPreferenceWritable); } } } I don't feel quite right with this solution since I make a copy of values in valueArray and believe that it will cost me OoutOfMemoryExceptions with larger data sets. thanks and br reinis
Re: Can any one help
It seems there is a problem with your hdfs, how did you configure that? --sebastian On 04/08/2014 07:23 PM, Neetha wrote: Hi, I am trying to run Mahout -kmeans clustering on hadoop, but I am getting this error, hduser3@ubuntu:/usr/local/hadoop-1.0.1/mahout3$ bin/mahout seqdirectory \-i mahout-work/reuters-out \-o mahout-work/reuters-out-seqdir \-c UTF-8 -chunk 5 Warning: $HADOOP_HOME is deprecated. hduser3@ubuntu:/usr/local/hadoop-1.0.1/mahout3$ bin/mahout seqdirectory \-i mahout-work/reuters-out \-o mahout-work/reuters-out-seqdir \-c UTF-8 -chunk 5 Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /usr/local/hadoop-1.0.1/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/hadoop-1.0.1/mahout3/examples/target/ mahout-examples-0.7-job.jar Warning: $HADOOP_HOME is deprecated. 14/04/07 12:10:14 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[5], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[mahout-work/reuters-out], --keyPrefix=[], --output=[mahout-work/reuters-out-seqdir], --startPhase=[0], --tempDir=[temp]} 14/04/07 12:10:15 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hduser3/mahout-work/reuters-out-seqdir/chunk-0 could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem. getAdditionalBlock(FSNamesystem.java:1556) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock( NameNode.java:696) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1093) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at org.apache.hadoop.ipc.Client.call(Client.java:1066) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy1.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod( RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke( RetryInvocationHandler.java:59) at $Proxy1.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream. locateFollowingBlock(DFSClient.java:3507) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream. nextBlockOutputStream(DFSClient.java:3370) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream. access$2700(DFSClient.java:2586) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ DataStreamer.run(DFSClient.java:2826) 14/04/07 12:10:15 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null 14/04/07 12:10:15 WARN hdfs.DFSClient: Could not get block locations. Source file /user/hduser3/mahout-work/reuters-out-seqdir/chunk-0 - Aborting... Apr 7, 2014 12:10:15 PM com.google.common.io.Closeables close WARNING: IOException thrown while closing Closeable. org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hduser3/mahout-work/reuters-out-seqdir/chunk-0 could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem. getAdditionalBlock(FSNamesystem.java:1556) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock( NameNode.java:696) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1093) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at
Re: Solr+Mahout Recommender Demo Site
The top 3 recommendations based on videos you liked are very good! Nice job. On 04/06/2014 07:26 PM, Pat Ferrel wrote: After having integrated several versions of the Mahout and Myrrix recommenders at fairly large scale. I was interested in solving three problems that these did not directly provide for: 1) realtime queries for recs using data not yet incorporated into the training set. Myrrix allows this but Mahout using the hadoop mr version does not. 2) cross-recommendations from two or more action types (say purchase and detail-view) 3) blending metadata and user preference data to return recs (for example category user preferences = recs) Using Solr + Mahout provided an amazingly flexible and performant way to do this. Ted wrote about his experience with this basic approach in his recent book. Take user preferences, run them through RowSimilarityJob and you get an item by item similarity Matrix. This is the core of an item-based cooccurrence recommender. If you take the similarity matrix, and convert it into a list of tokens per row, you have something Solr can index. If you then use a user’s history as a query on the indexed data you get an ordered list of recommendations. When I set out to do #1 and #3 the need for CF data AND metadata was the first problem. So I mined the web for video reviews and video metadata. Then logging any users who visit the site will lead to data for #2 and #1. The demo site is https://guide.finderbots.com and instructions are at the end of this for anyone who would like to test it out. As a crude user test there is a procedure we ask you to follow to help gather quality of recommendations data. It’s running out of my closet over Comcast so if it’s down I may have tripped over a cord, sorry try again later. There are a bunch of different methods for making recs illustrated on the site. One method that illustrates blending metadata uses preference data from you, and metadata to bias and filter recs. Imagine that you have trained the system with your preferences by making some video picks. Now imagine you’d like to get recommendations for Comedies from Neflix based on your previous video preferences. This is done with a single Solr query on indexed video fields that hold genre, similar videos (from the similarity matrix), and sources. The query finds similar videos to the ones you have liked, with the genre “Comedy” boosted by some amount, but only those that have at least one source = “Netflix”. I’ll be doing some blog posts covering the specifics of how each rec type is done, the site and DB architecture, and Solr setup. The project uses the Solr recommender prep code here: https://github.com/pferrel/solr-recommender BTW I plan to publish obfuscated usage data in the github repo. begin form letter === Please use a very newly updated browser (latest Firefox, Chrome, Safari, and nothing older than IE10) the site doesn’t yet check browser compatibility but relies on HTML5 and CSS3 rather heavily. 1) go to https://guide.finderbots.com/users/sign_up to create an account 2) go to https://guide.finderbots.com/trainers to ’train' the recommender hit thumbs up on videos you like. There are 20 pages of training videos, you can leave at any time but if you can go through them all it would be appreciated. 3) go to https://guide.finderbots.com/guides/recommend to immediately get personalized recs from your training data. If you completed the trainer check the top line of recs, count how many are videos you liked or would like to see. Scroll right or left to see a total of 24 in four batches of 6. If you could report to me the total you thought were good recs it would be greatly appreciated. 4) browse videos by various criteria here: https://guide.finderbots.com/guides These are not recommendations, they are simply a catalog. 5) control how you browse videos by clicking the gears icon. You can set all videos to be from one or more sources here. If you choose Netflix alone (don’t forget to uncheck ‘all’) then recs and browsed videos will all be available on Netflix.
Re: Number of features for ALS
Use k-fold cross-validation or hold-out tests for estimating the quality of different parameter combinations. --sebastian On 03/30/2014 11:53 AM, Niklas Ekvall wrote: Hi, My name is Niklas Ekvall and I have a implementation of the recommender algorithm Large-scale Parallel Collaborative Filtering for the Netflix Prize and now I'm wondering how to choose the number of features and lambda. Could any of guys help me to explain a stepwise strategy to choose or optimize these two parameters? Best regards, Niklas 2014-03-27 19:07 GMT+01:00 j.barrett Strausser j.barrett.straus...@gmail.com: Thanks Ted, Yes for the time problem. We tend to use aggregations of session data. So instead of asking for user recommendations we do things like user+sessions recommendations. Of course, deciding when sessions start and stop isn't trivial. I ideally what I would want to is time-weight views using a kernel or convolution. That's a bit heavy so we typically have a global model, that is is basically all preferences over times. Then these user+session type models. We can then combine these at another level to give recommendations based on what you like throughout time versus what you have been doing recently. -b On Thu, Mar 27, 2014 at 1:59 PM, Ted Dunning ted.dunn...@gmail.com wrote: For the poly-syllable challenged, hetereoscedasticity - degree of variation changes. This is common with counts because you expect the standard deviation of count data to be proportional to sqrt(n). time imhogeneity - changes in behavior over time. One way to handle this (roughly) is to first remove variation in personal and item means over time (if using ratings) and then to segment user histories into episodes. By including both short and long episodes you get some repair for changes in personal preference. A great example of how this works/breaks is Christmas music. On December 26th, you want to *stop* recommending this music so it really pays to limit histories at this point. By having an episodic user session that starts around November and runs to Christmas, you can get good recommendations for seasonal songs and not pollute the rest of the universe. On Thu, Mar 27, 2014 at 8:30 AM, j.barrett Strausser j.barrett.straus...@gmail.com wrote: For my team it has usually been hetereoscedasticity and time inhomogeneity. On Thu, Mar 27, 2014 at 10:18 AM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: Interesting topic, Ted, can you give examples of those mathematical assumptions under-pinning ALS which are violated by the real world? On Thu, Mar 27, 2014 at 3:43 PM, Ted Dunning ted.dunn...@gmail.com wrote: How can there be any other practical method? Essentially all of the mathematical assumptions under-pinning ALS are violated by the real world. Why would any mathematical consideration of the number of features be much more than heuristic? That said, you can make an information content argument. You can also make the argument that if you take too many features, it doesn't much hurt so you should always take as many as you can compute. On Thu, Mar 27, 2014 at 6:33 AM, Sebastian Schelter s...@apache.org wrote: Hi, does anyone know of a principled approach of choosing the number of features for ALS (other than cross-validation?) --sebastian -- https://github.com/bearrito @deepbearrito -- https://github.com/bearrito @deepbearrito
Re: (help!) Can someone scan this
Jay, which version of Mahout are you using? Have you tried to explicitly set the temp path? --sebastian On 03/29/2014 01:52 AM, Jay Vyas wrote: Hi again mahout: Im wrapping a distributed recommender like this: https://raw.githubusercontent.com/jayunit100/bigpetstore/master/src/main/java/org/bigtop/bigpetstore/clustering/BPSRecommnder.java And its not working. Any thoguhts on why? The error message is simply that intermediate data sets dont exist (i.e. numUsers.bin or /tmp/preparePreferencesMatrix...). Basically its clear that the intermediate jobs are failing but i cant see any reason why they would fail And I don't see any meaningfull stack traces. I've found alot of good whitepapers and stuff on how the algorithms work , but its not clear what is really done for me by mahout, and what i have to do on my own for the distributed recommender APIs.
Re: The 3 distributed recommenders
Hi Jay, there's not much documentation unfortunately. We're in the process of creating that however. We removed the pseudo-distributed recommender, mainly because nobody ever used it. There are two research papers that could help you with understanding the other two distributed recommenders: For ALS: Distributed Matrix Factorization with MapReduce using a series of Broadcast-Joins, RecSys'13 http://ssc.io/wp-content/uploads/2011/12/sys024-schelter.pdf For item-based: Scalable Similarity-Based Neighborhood Methods with MapReduce, RecSys'12 http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf On 03/28/2014 02:04 PM, Jay Vyas wrote: Hi mahout: Looking through the source code there are 3 distributed recommenders... the als recommender the item recommender the pseudo recommender Any docs differentiating these?
Number of features for ALS
Hi, does anyone know of a principled approach of choosing the number of features for ALS (other than cross-validation?) --sebastian
Re: Does Recommender System Overview Demo work?
Hi Bhargav, you are right, the content on the page is outdated and contains some errors. I've created a jira ticket to fix this [1]. Thank you for reporting the problem! [1] https://issues.apache.org/jira/browse/MAHOUT-1485 On 03/24/2014 04:41 AM, Bhargav Golla wrote: Hi I was wondering if the demo existing at https://mahout.apache.org/users/recommender/recommender-documentation.htmlstill works. I don't find webapp directory in integration/ and hence even after I add jetty plugin in the pom.xml in integration/, it is throwing an exception. Bhargav Golla Committer, ASF Github http://www.github.com/bhargavgolla | LinkedINhttp://www.linkedin.com/in/bhargavgolla | Website http://www.bhargavgolla.com/
Re: Does Recommender System Overview Demo work?
The webapp in Mahout does not offer much functionality. If you'd like to use Mahout via a webinterface, I suggest you either use predictionIO [1] or kornakapi [2]. Best, Sebastian [1} http://prediction.io [2] http://ssc.io/a-recommendation-webservice-in-10-minutes/ On 03/24/2014 02:29 PM, Bhargav Golla wrote: Hi Sebastian Thanks for letting me know. I was wondering if it was removed only in 0.9 version. Can I check the 0.8 branch and use the webapp in that branch? Bhargav Golla Developer. Freelancer. Github http://www.github.com/bhargavgolla | LinkedINhttp://www.linkedin.com/in/bhargavgolla | Website http://www.bhargavgolla.com/ On Mon, Mar 24, 2014 at 2:12 AM, Sebastian Schelter s...@apache.org wrote: Hi Bhargav, you are right, the content on the page is outdated and contains some errors. I've created a jira ticket to fix this [1]. Thank you for reporting the problem! [1] https://issues.apache.org/jira/browse/MAHOUT-1485 On 03/24/2014 04:41 AM, Bhargav Golla wrote: Hi I was wondering if the demo existing at https://mahout.apache.org/users/recommender/recommender- documentation.htmlstill works. I don't find webapp directory in integration/ and hence even after I add jetty plugin in the pom.xml in integration/, it is throwing an exception. Bhargav Golla Committer, ASF Github http://www.github.com/bhargavgolla | LinkedINhttp://www.linkedin.com/in/bhargavgolla | Website http://www.bhargavgolla.com/
Re: Does Recommender System Overview Demo work?
Would be great to have such an overview on the mahout website. On 03/24/2014 03:18 PM, Jay Vyas wrote: I've tried to start disambiguating the difference between mahout distributed vs local tutorials here, because ive found it causes problems for a lot of people (including me) http://jayunit100.blogspot.com/2014/02/a-few-nice-posts-about-distirbuted.html anyone want to collaborate on a two table wiki page which links to tutorials about distributed vs single node implementations of all algorithms? On Mon, Mar 24, 2014 at 10:00 AM, Suneel Marthi suneel_mar...@yahoo.comwrote: It was removed in 0.9 and am not sure if it was there in 0.8. I vaguely remember removing it in 0.9 based on a conversation with Manuel on user@. Manuel, if u could chime in here. On Monday, March 24, 2014 9:44 AM, Sebastian Schelter s...@apache.org wrote: The webapp in Mahout does not offer much functionality. If you'd like to use Mahout via a webinterface, I suggest you either use predictionIO [1] or kornakapi [2]. Best, Sebastian [1} http://prediction.io [2] http://ssc.io/a-recommendation-webservice-in-10-minutes/ On 03/24/2014 02:29 PM, Bhargav Golla wrote: Hi Sebastian Thanks for letting me know. I was wondering if it was removed only in 0.9 version. Can I check the 0.8 branch and use the webapp in that branch? Bhargav Golla Developer. Freelancer. Github http://www.github.com/bhargavgolla | LinkedINhttp://www.linkedin.com/in/bhargavgolla | Website http://www.bhargavgolla.com/ On Mon, Mar 24, 2014 at 2:12 AM, Sebastian Schelter s...@apache.org wrote: Hi Bhargav, you are right, the content on the page is outdated and contains some errors. I've created a jira ticket to fix this [1]. Thank you for reporting the problem! [1] https://issues.apache.org/jira/browse/MAHOUT-1485 On 03/24/2014 04:41 AM, Bhargav Golla wrote: Hi I was wondering if the demo existing at https://mahout.apache.org/users/recommender/recommender- documentation.htmlstill works. I don't find webapp directory in integration/ and hence even after I add jetty plugin in the pom.xml in integration/, it is throwing an exception. Bhargav Golla Committer, ASF Github http://www.github.com/bhargavgolla | LinkedINhttp://www.linkedin.com/in/bhargavgolla | Website http://www.bhargavgolla.com/
Re: Problem with K-Means clustering on Amazon EMR
Hi Konstantin, Great to see that you located the error. Could you open a jira issue and submit a patch that contains an updated error message? Thank you, Sebastian On 03/23/2014 02:57 PM, Konstantin Slisenko wrote: Hi! I investigated the situation. RandomSeedGenerator ( http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?av=f) has following code: FileSystem fs = FileSystem.get(output.toUri(), conf); ... fs.getFileStatus(input).isDir() FileSystem object was created from output path, which was not specified correctly by me. (I didn't use prefix s3:// for this path). Afterwards getFileStatus has parameter for input path, which was correct. This caused misunderstanding. To prevent this misunderstanding, I propose to improve error message adding following details: 1. Specify which filesystem type used (DistributedFileSystem, NativeS3FileSystem, etc. using fs.getClass().getName()) 2. Then specify which path can not be processed correctly. This can be done by validation utility which can be applied to many places in Mahout. When we use Mahout we need to specify many paths and we also can use many types of file systems: local for debugging, distributed on Hadoop, and s3 on Amazon. In this case better error messages can save much time. I think that refactoring is not needed for this case. 2014-03-16 22:19 GMT+03:00 Jay Vyas jayunit...@gmail.com: I agree best to be explicit when creating filesystem instances by using the two argument get(...). it's time to update it filesystem 2.0 Apis. Can you file a Jira for this ? If not I will :) On Mar 16, 2014, at 12:37 PM, Sebastian Schelter s...@apache.org wrote: I've also encountered a similar error once. It's really just the FileSystem.get call that needs to be modified. I think its a good idea to walk through the codebase and refactor this where necessary. --sebastian On 03/16/2014 05:16 PM, Andrew Musselman wrote: Another wild guess, I've had issues trying to use the 's3' protocol from Hadoop and got things working by using the 's3n' protocol instead. On Mar 16, 2014, at 8:41 AM, Jay Vyas jayunit...@gmail.com wrote: I specifically have fixed mapreduce jobs by doing what the error message suggests. But maybe (hopefully) there is another workaround that is configuration driven. Just a hunch but, Maybe mahout needs to be refactored to create fs objects using the get(uri,conf) calls? As hadoop evolves to support different flavored of hcfs probably using API calls that are more flexible (i.e. Like the fs.get(uri,conf) one), will probably be a good thing to keep in mind. On Mar 16, 2014, at 9:22 AM, Frank Scholten fr...@frankscholten.nl wrote: Hi Konstantin, Good to hear from you. The link you mentioned points to EigenSeedGenerator not RandomSeedGenerator. The problem seems to be with the call to fs.getFileStatus(input).isDir() It's been a while and I don't remember but perhaps you have to set additional Hadoop fs properties to use S3. See https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the cause of this by creating a small Java main app with that line of code and run it in the debugger. Cheers, Frank On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko kslise...@gmail.comwrote: Hello! I run a text-documents clustering on Hadoop cluster in Amazon Elastic Map Reduce. As input and output I use S3 Amazon file system. I specify all paths as s3://bucket-name/folder-name. SparceVectorsFromSequenceFile works correctly with S3 but when I start K-Means clustering job, I get this error: Exception in thread main java.lang.IllegalArgumentException: This file system object (hdfs://172.31.41.65:9000) does not support access to the request path 's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path. at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375) at org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530) at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121) at bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause of this a at bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41
Documentation, documentation, documentation
Hi, It's great to see a lot of work being spent on cleaning up the website. I think we have already done a great job here, but there are still a few more pages that need work. I created a jira issue for every single page that needs some work, would be awesome if we could find enough volunteers to finish this quickly. If you wanna take a ticket, write a comment that you start work on it, go through the website, check it for dead links and formatting errors and try out examples that are listed with the current release to see if everything still works. Either attach a textfile containing a new version of the page to the issue or add a comment on the issue that details the fix that you want to see (e.g. remove link ... because it is dead.) Here's an overview of the tickets: MAHOUT-1471 Clean up website on Canopy Clustering MAHOUT-1472 Clean up website on Fuzzy k-Means MAHOUT-1473 Clean up website on Spectral Clustering MAHOUT-1474 Add Seinfeld clustering example MAHOUT-1475 Clean up website on Naive Bayes MAHOUT-1476 Clean up website on Hidden Markov Models MAHOUT-1477 Clean up website on Logistic Regression MAHOUT-1478 Clean up website on Random Forests MAHOUT-1479 Clean up website on wikipedia example MAHOUT-1480 Clean up website on 20 newsgroups MAHOUT-1481 Clean up website on breiman example MAHOUT-1482 Rework quickstart website I would kindly ask Shannon to take 1473, Frank 1474 and Frank or Ted 1477. Let's quickly finish the work on documenting what we have, so we can move on to new and exciting developments in Mahout! --sebastian
Re: Documentation, documentation, documentation
Sry, I seem to have overlooked this. Could you move the cleanings of canopy to 1471? Thank you. On 03/22/2014 04:54 PM, Pavan Kumar N wrote: i have already added canopy vlustering cleansing as part of jira 1450 .. also created new issue for adding streaming kmeans . On Mar 22, 2014 8:37 PM, Sebastian Schelter s...@apache.org wrote: Hi, It's great to see a lot of work being spent on cleaning up the website. I think we have already done a great job here, but there are still a few more pages that need work. I created a jira issue for every single page that needs some work, would be awesome if we could find enough volunteers to finish this quickly. If you wanna take a ticket, write a comment that you start work on it, go through the website, check it for dead links and formatting errors and try out examples that are listed with the current release to see if everything still works. Either attach a textfile containing a new version of the page to the issue or add a comment on the issue that details the fix that you want to see (e.g. remove link ... because it is dead.) Here's an overview of the tickets: MAHOUT-1471 Clean up website on Canopy Clustering MAHOUT-1472 Clean up website on Fuzzy k-Means MAHOUT-1473 Clean up website on Spectral Clustering MAHOUT-1474 Add Seinfeld clustering example MAHOUT-1475 Clean up website on Naive Bayes MAHOUT-1476 Clean up website on Hidden Markov Models MAHOUT-1477 Clean up website on Logistic Regression MAHOUT-1478 Clean up website on Random Forests MAHOUT-1479 Clean up website on wikipedia example MAHOUT-1480 Clean up website on 20 newsgroups MAHOUT-1481 Clean up website on breiman example MAHOUT-1482 Rework quickstart website I would kindly ask Shannon to take 1473, Frank 1474 and Frank or Ted 1477. Let's quickly finish the work on documenting what we have, so we can move on to new and exciting developments in Mahout! --sebastian
Re: Problem with K-Means clustering on Amazon EMR
I've also encountered a similar error once. It's really just the FileSystem.get call that needs to be modified. I think its a good idea to walk through the codebase and refactor this where necessary. --sebastian On 03/16/2014 05:16 PM, Andrew Musselman wrote: Another wild guess, I've had issues trying to use the 's3' protocol from Hadoop and got things working by using the 's3n' protocol instead. On Mar 16, 2014, at 8:41 AM, Jay Vyas jayunit...@gmail.com wrote: I specifically have fixed mapreduce jobs by doing what the error message suggests. But maybe (hopefully) there is another workaround that is configuration driven. Just a hunch but, Maybe mahout needs to be refactored to create fs objects using the get(uri,conf) calls? As hadoop evolves to support different flavored of hcfs probably using API calls that are more flexible (i.e. Like the fs.get(uri,conf) one), will probably be a good thing to keep in mind. On Mar 16, 2014, at 9:22 AM, Frank Scholten fr...@frankscholten.nl wrote: Hi Konstantin, Good to hear from you. The link you mentioned points to EigenSeedGenerator not RandomSeedGenerator. The problem seems to be with the call to fs.getFileStatus(input).isDir() It's been a while and I don't remember but perhaps you have to set additional Hadoop fs properties to use S3. See https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the cause of this by creating a small Java main app with that line of code and run it in the debugger. Cheers, Frank On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko kslise...@gmail.comwrote: Hello! I run a text-documents clustering on Hadoop cluster in Amazon Elastic Map Reduce. As input and output I use S3 Amazon file system. I specify all paths as s3://bucket-name/folder-name. SparceVectorsFromSequenceFile works correctly with S3 but when I start K-Means clustering job, I get this error: Exception in thread main java.lang.IllegalArgumentException: This file system object (hdfs://172.31.41.65:9000) does not support access to the request path 's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path. at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375) at org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530) at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121) at bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause of this a at bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) I checked RandomSeedGenerator.buildRandom ( http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/EigenSeedGenerator.java?av=f ) and I assume it has correct code: FileSystem fs = FileSystem.get(output.toUri(), conf); I can not run clustering because of this error. May be you have any ideas how to fix this?
Re: Compiling Mahout with maven in Eclipse
Maven should generate the classes automatically. Have you tried running mvn -DskipTests clean install on the commandline? On 03/13/2014 09:50 AM, Kevin Moulart wrote: How can I generate them to make these errors go away then ? Or don't I have to ? Kévin Moulart 2014-03-13 9:17 GMT+01:00 Sebastian Schelter ssc.o...@googlemail.com: Those are autogenerated. On 03/13/2014 09:05 AM, Kevin Moulart wrote: Ok it does compile with maven in eclipse as well, but still, many imports are not recognized in the sources : - import org.apache.mahout.math.function.IntObjectProcedure; - import org.apache.mahout.math.map.OpenIntLongHashMap; - import org.apache.mahout.math.map.OpenIntObjectHashMap; - import org.apache.mahout.math.set.OpenIntHashSet; - import org.apache.mahout.math.list.DoubleArrayList; ... Pretty much all the problems come from the OpenInt... classes that it doesn't seem to find. Is there a jar or a pom entry I need to add here ? Or do I have the wrong version of org.apache.mahout.math, because I can't find those maps/sets/lists in the math package ? (I have the same problem on both my windows, centos and mac os) Kévin Moulart 2014-03-12 17:00 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com: Never mind, I found where the problem lied, I deleted the full content of .m2 and retried it as non root user and it worked. Trying in Eclipse now, with tests I'll let you now if it doesn't work. Kévin Moulart 2014-03-12 16:45 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com: Hi, I tried to fix all the problem I had to configure eclipse in order to compile mahout in it using maven clean package as goal. First I had to make a change in mahout core in the class GroupTree.java, line 171 : stack = new ArrayDequeGroupTree(); Then I tried compiling with eclipse (I already had the plugin and all imported and I'm working on the trunk version). From eclipse it runs until it tries compiling the examples : [INFO] Building jar: /home/myCompany/Workspace_eclipse/mahout-trunk/examples/ target/mahout-examples-1.0-SNAPSHOT-job.jar [INFO] [INFO] Reactor Summary: [INFO] [INFO] Mahout Build Tools SUCCESS [ 1.173 s] [INFO] Apache Mahout . SUCCESS [ 0.307 s] [INFO] Mahout Math ... SUCCESS [ 8.041 s] [INFO] Mahout Core ... SUCCESS [ 8.378 s] [INFO] Mahout Integration SUCCESS [ 1.030 s] [INFO] Mahout Examples ... FAILURE [ 5.325 s] [INFO] Mahout Release Package SKIPPED [INFO] Mahout Math/Scala wrappers SKIPPED [INFO] Mahout Spark bindings . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 24.630 s [INFO] Finished at: 2014-03-12T16:38:08+01:00 [INFO] Final Memory: 101M/1430M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.4:single (job) on project mahout-examples: Failed to create assembly: Error creating assembly archive job: IOException when zipping com/ibm/icu/ICUConfig.properties: invalid LOC header (bad signature) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/ MojoExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :mahout-examples It does the exact same thing when I try typing mvn clean package in terminal, but when I try it as root, it works, so it might be an issue with the permissions however I fail to see where (I did a chown -R on my entire home folder just to be on the safe side and it still fails). Anyone had the same problem ? Any idea about how to fix it ? Kévin Moulart
Re: Compiling Mahout with maven in Eclipse
Are executing maven in the topmost directory? On 03/13/2014 10:09 AM, Kevin Moulart wrote: I did, but then it fails because of these missing files : https://gist.github.com/kmoulart/9524828 Kévin Moulart 2014-03-13 9:57 GMT+01:00 Sebastian Schelter s...@apache.org: Maven should generate the classes automatically. Have you tried running mvn -DskipTests clean install on the commandline? On 03/13/2014 09:50 AM, Kevin Moulart wrote: How can I generate them to make these errors go away then ? Or don't I have to ? Kévin Moulart 2014-03-13 9:17 GMT+01:00 Sebastian Schelter ssc.o...@googlemail.com: Those are autogenerated. On 03/13/2014 09:05 AM, Kevin Moulart wrote: Ok it does compile with maven in eclipse as well, but still, many imports are not recognized in the sources : - import org.apache.mahout.math.function.IntObjectProcedure; - import org.apache.mahout.math.map.OpenIntLongHashMap; - import org.apache.mahout.math.map.OpenIntObjectHashMap; - import org.apache.mahout.math.set.OpenIntHashSet; - import org.apache.mahout.math.list.DoubleArrayList; ... Pretty much all the problems come from the OpenInt... classes that it doesn't seem to find. Is there a jar or a pom entry I need to add here ? Or do I have the wrong version of org.apache.mahout.math, because I can't find those maps/sets/lists in the math package ? (I have the same problem on both my windows, centos and mac os) Kévin Moulart 2014-03-12 17:00 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com: Never mind, I found where the problem lied, I deleted the full content of .m2 and retried it as non root user and it worked. Trying in Eclipse now, with tests I'll let you now if it doesn't work. Kévin Moulart 2014-03-12 16:45 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com: Hi, I tried to fix all the problem I had to configure eclipse in order to compile mahout in it using maven clean package as goal. First I had to make a change in mahout core in the class GroupTree.java, line 171 : stack = new ArrayDequeGroupTree(); Then I tried compiling with eclipse (I already had the plugin and all imported and I'm working on the trunk version). From eclipse it runs until it tries compiling the examples : [INFO] Building jar: /home/myCompany/Workspace_eclipse/mahout-trunk/examples/ target/mahout-examples-1.0-SNAPSHOT-job.jar [INFO] [INFO] Reactor Summary: [INFO] [INFO] Mahout Build Tools SUCCESS [ 1.173 s] [INFO] Apache Mahout . SUCCESS [ 0.307 s] [INFO] Mahout Math ... SUCCESS [ 8.041 s] [INFO] Mahout Core ... SUCCESS [ 8.378 s] [INFO] Mahout Integration SUCCESS [ 1.030 s] [INFO] Mahout Examples ... FAILURE [ 5.325 s] [INFO] Mahout Release Package SKIPPED [INFO] Mahout Math/Scala wrappers SKIPPED [INFO] Mahout Spark bindings . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 24.630 s [INFO] Finished at: 2014-03-12T16:38:08+01:00 [INFO] Final Memory: 101M/1430M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.4:single (job) on project mahout-examples: Failed to create assembly: Error creating assembly archive job: IOException when zipping com/ibm/icu/ICUConfig.properties: invalid LOC header (bad signature) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/ MojoExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :mahout-examples It does the exact same thing when I try typing mvn clean package in terminal, but when I try it as root, it works, so it might be an issue with the permissions however I fail to see where (I did a chown -R on my entire home folder just to be on the safe side and it still fails). Anyone had the same problem ? Any idea about how to fix it ? Kévin Moulart
Re: verbose output
To my knowledge, there is no such flag for mahout. You can check hadoop's logs for further information however. On 03/13/2014 10:21 AM, Mahmood Naderan wrote: Hi, Is there any verbosity flag for hadoop and mahout commands? I can not find such thing in the command line. Regards, Mahmood
Re: Website, urgent help needed
Hi Scott, Create a jira ticket and attach your scripts and a text version of the page there. Best, Sebastian On 03/12/2014 03:27 PM, Scott C. Cote wrote: I took the tour of the text analysis and pushed through despite the problems on the page. Commiters helped me over the hump where others might have just gave up (to your point). When I did it, I made shell scripts so that my steps would be repeatable with an anticipation of updating the page. Unforunately, I gave up on trying to figure out how to update the page (there were links indicating that I could do it), and I didn¹t want to appear to be stupid asking how to update the documentation (my bad - not anyone else). Now I know that it was not possible unless I was a commiter. Who should I send my scripts to, or how should I proceed with a current form of the page? SCott On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote: Hi Pavan, Awesome that you're willing to help. The documentation are the pages listed under Clustering in the navigation bar under mahout.apache.org If you start working on one of the pages listed there (e.g. the k-Means doc), please created jira ticket in our issue tracker with a title along the lines of Cleaning up the documentation for k-Means on the website. Put a list of errors and corrections into the jira and I (or some other committer) will make sure to fix the website. Thanks, Sebastian On 03/12/2014 08:48 AM, Pavan Kumar N wrote: i ll help with clustering algorithms documentation. do send me old documentation and i will check and remove errors. or better let me know how to proceed. Pavan On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote: Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Problem with FileSystem in Kmeans
Hi Bikash, Have you tried adding hdfs:// to your input path? Maybe that helps. --sebastian On 03/11/2014 11:22 AM, Bikash Gupta wrote: Hi, I am running Kmeans in cluster where I am setting the configuration of fs.hdfs.impl and fs.file.impl before hand as mentioned below conf.set(fs.hdfs.impl,org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()); conf.set(fs.file.impl,org.apache.hadoop.fs.LocalFileSystem.class.getName()); Problem is that cluster-0 directory is getting created in local file system and cluster-1 is getting created in HDFS, and Kmeans map reduce job is unable to find cluster-0 . Please see below the stacktrace 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments: {--clustering=null, --clusters=[/3/clusters-0-final], --convergenceDelta=[0.1], --distanceMeasure=[org.apache.mahout.common.distance.EuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100], --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence Clusters In: /3/clusters-0-final Out: /5 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max Iterations: 100 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths to process : 3 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job: job_201403111332_0011 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO] map 0% reduce 0% 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id : attempt_201403111332_0011_m_00_0, Status : FAILED 2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException: /5/clusters-0 at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:78) at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:208) at org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.io.FileNotFoundException: File /5/clusters-0 Please suggest!!!
Website, urgent help needed
Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Website, urgent help needed
We don't exactly have that page, but we have pages that touch parts of it, such as https://mahout.apache.org/users/basics/creating-vectors-from-text.html It would be great if you could create a jira ticket which lists the errors. I'll fix them then. Best, Sebastian On 03/12/2014 08:42 AM, Juan José Ramos wrote: Hi Sebastian, I am afraid I am only familiar with the recommendation part. In previous posts, I pointed a couple of errors in this wiki page: https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line If you are planning to keep it in the new web, I can help pointing them out again. Thanks a lot for your effort. On Wed, Mar 12, 2014 at 7:03 AM, Sebastian Schelter s...@apache.org wrote: Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Website, urgent help needed
Hi Pavan, Awesome that you're willing to help. The documentation are the pages listed under Clustering in the navigation bar under mahout.apache.org If you start working on one of the pages listed there (e.g. the k-Means doc), please created jira ticket in our issue tracker with a title along the lines of Cleaning up the documentation for k-Means on the website. Put a list of errors and corrections into the jira and I (or some other committer) will make sure to fix the website. Thanks, Sebastian On 03/12/2014 08:48 AM, Pavan Kumar N wrote: i ll help with clustering algorithms documentation. do send me old documentation and i will check and remove errors. or better let me know how to proceed. Pavan On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote: Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Website, urgent help needed
Hi Manoj, Awesome that you're willing to help. I suggest we proceed analogously to the clustering cleanup: The documentation are the pages listed under Classification in the navigation bar under mahout.apache.org If you start working on one of the pages listed there (e.g. the Naive Bayes doc), please created jira ticket in our issue tracker with a title along the lines of Cleaning up the documentation for Naive Bayes on the website. Put a list of errors and corrections into the jira and I (or some other committer) will make sure to fix the website. Best, Sebastian On 03/12/2014 09:05 AM, Manoj Awasthi wrote: Thanks Sebastian to you and others for effort in cleaning up the website interface. It looks much better (fonts layout) and much more usable if I may say. I will be happy to volunteer for the pages under classification in whatever ways I can. I would want to contribute specially on verifying that the examples provided work in the form they exist on the website and will be happy to do any corrections wherever possible. If there is initial backlog list which provides tasks at a granular level then it will be great OR I can start looking on the page myself. Manoj On Wed, Mar 12, 2014 at 12:33 PM, Sebastian Schelter s...@apache.org wrote: Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Website, urgent help needed
Hi Kevin, Thank you for offer to help! Feel free to ask questions here how to setup the sources in Eclipse. If you succeed, you could writeup what you did and we could add this to the website, as I'm sure a lot of others will have the same problem. It would be great if you could start improving the javadoc, its totally fine if your english is not perfect, we can always ask a native speaker to read over it. If you start working on the javadoc, please create a jira issue for that work before you start. Best, Sebastian On 03/12/2014 09:30 AM, Kevin Moulart wrote: I can confirm what Sebastian said, I'm fairly new on this and I did find myself so desperate at some point that I almost gave up on Mahout dut to lack of documentation, but my feeling is that it doesn't only concerns the website : the API is too few documented as well. At this point there are no simple way for a beginner to know what kind of format any one of the algorithms expects and what it outputs exactly, how to chain processes etc... They might go as far as reading the javadoc (although not everyone does that) but they won't all, as I had to and did, download the sources and try making sense of them to get the information. Hopefully the mailing list is particularly active and one can find the answer if he has time and will to search them and ask kindly, which is a very strong strength of Mahout, but the average beginner, wanting to just try the library can't and won't do that. I'm willing to document the parts of the code I used and began to understand, however I've been facing difficulties to set up the maven project in eclipse for now. Also since I'm a Belgian, English is not my mother tongue so I'm almost certain to make mistakes, but I think it would take less time to you to correct these few English mistakes than to write the documentation :) I'll go ahead and try to set thing up with Eclipse and if I don't succeed I'll write a mail on the dev list for help in that matter. I also can, if I find the time, continue my efforts of reporting bugs and not working or accurate links and descriptions on the website, if need be and update my JIRA entry accordingly. Kévin Moulart 2014-03-12 8:48 GMT+01:00 Pavan Kumar N pavan.naraya...@gmail.com: i ll help with clustering algorithms documentation. do send me old documentation and i will check and remove errors. or better let me know how to proceed. Pavan On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote: Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Website, urgent help needed
Here you can see all issues (resolved and unresolved) for the next release: https://issues.apache.org/jira/browse/MAHOUT-1413?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%201.0%20ORDER%20BY%20priority%20DESC When you start to work on the cleanup of a page, make sure that there is no ticket existing for that. If it isnt, create a jira ticket with the name of the page in the title. --sebastian On 03/12/2014 11:20 AM, pramit choudhary wrote: Hi All, I would also like to participate in cleaning up the documentation. Since, I am fairly new to the Mahout infrastructure. It will in-turn help me understand things better. Do we already have a Jira ticket for organizing the cleaning up of documentation ? Just want to be sure, that I am not stepping on pages some else has already updated. Thanks Regards, Pramit On Wed, Mar 12, 2014 at 3:07 AM, Sebastian Schelter s...@apache.org wrote: Hi Kevin, Thank you for offer to help! Feel free to ask questions here how to setup the sources in Eclipse. If you succeed, you could writeup what you did and we could add this to the website, as I'm sure a lot of others will have the same problem. It would be great if you could start improving the javadoc, its totally fine if your english is not perfect, we can always ask a native speaker to read over it. If you start working on the javadoc, please create a jira issue for that work before you start. Best, Sebastian On 03/12/2014 09:30 AM, Kevin Moulart wrote: I can confirm what Sebastian said, I'm fairly new on this and I did find myself so desperate at some point that I almost gave up on Mahout dut to lack of documentation, but my feeling is that it doesn't only concerns the website : the API is too few documented as well. At this point there are no simple way for a beginner to know what kind of format any one of the algorithms expects and what it outputs exactly, how to chain processes etc... They might go as far as reading the javadoc (although not everyone does that) but they won't all, as I had to and did, download the sources and try making sense of them to get the information. Hopefully the mailing list is particularly active and one can find the answer if he has time and will to search them and ask kindly, which is a very strong strength of Mahout, but the average beginner, wanting to just try the library can't and won't do that. I'm willing to document the parts of the code I used and began to understand, however I've been facing difficulties to set up the maven project in eclipse for now. Also since I'm a Belgian, English is not my mother tongue so I'm almost certain to make mistakes, but I think it would take less time to you to correct these few English mistakes than to write the documentation :) I'll go ahead and try to set thing up with Eclipse and if I don't succeed I'll write a mail on the dev list for help in that matter. I also can, if I find the time, continue my efforts of reporting bugs and not working or accurate links and descriptions on the website, if need be and update my JIRA entry accordingly. Kévin Moulart 2014-03-12 8:48 GMT+01:00 Pavan Kumar N pavan.naraya...@gmail.com: i ll help with clustering algorithms documentation. do send me old documentation and i will check and remove errors. or better let me know how to proceed. Pavan On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote: Hi, As you've probably noticed, I've put in a lot of effort over the last days to kickstart cleaning up our website. I've thrown out a lot of stuff and have been startled by the amout of outdated and incorrect information on our website, as well as links pointing to nowhere. I think our lack of documentation makes it superhard to use Mahout for new people. A crucial next step is to clean up the documentation on classification and clustering. I cannot do this alone, because I don't have the time and I'm not so familiar with the background of the algorithms. I need volunteers to go through all the pages under Classification and Clustering on the website. For the algorithms, the content and claims of the articles need to be checked, for the examples we need to make sure that everything still works as described. It would also be great to move articles from personal blogs to our website. Imagine that some developer wants to try out Mahout and takes one hour for that in the evening. She will go to our website, download Mahout, read the description of an algorithm and try to run an example. In the current state of the documentation, I'm afraid that most people will walk away frustrated, because the website does not help them as it should. Best, Sebastian PS: I will make my standpoint on whether Mahout should do a 1.0 release depend on whether we manage to clean up and maintain our documentation.
Re: Few questions about SVM configuration in Mahout
Hi Quentin, Mahout does not have SVMs. Best, Sebastian On 03/10/2014 10:38 AM, Quentin-Gabriel Thurier wrote: Hi all, Just few questions about the configuration of an SVM in Mahout : - Is it possible to do a multi-class classification ? - Which kernels are already available (linear, polynomial, rbf) ? - Where can we find details about the way the algorithm has been distributed ? Many thanks, Quentin
Re: [blog post] Comparing Document Classification Functions of Lucene and Mahout
Hi Koji, I've added a link to your article to our website: https://mahout.apache.org/general/books-tutorials-and-talks.html On 03/07/2014 03:29 AM, Koji Sekiguchi wrote: Hello, I just posted an article on Comparing Document Classification Functions of Lucene and Mahout. http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html Comments are welcome. :) Thanks! koji
Re: Heap space
I usually do try and error. Start with some very large value and do a binary search :) --sebastian On 03/09/2014 01:30 PM, Mahmood Naderan wrote: Excuse me, I added the -Xmx option and restarted the hadoop services using sbin/stop-all.sh sbin/start-all.sh however still I get heap size error. How can I find the correct and needed heap size? Regards, Mahmood On Sunday, March 9, 2014 1:37 PM, Mahmood Naderan nt_mahm...@yahoo.com wrote: OK I found that I have to add this property to mapred-site.xml property namemapred.child.java.opts/name value-Xmx2048m/value /property Regards, Mahmood On Sunday, March 9, 2014 11:39 AM, Mahmood Naderan nt_mahm...@yahoo.com wrote: Hello, I ran this command ./bin/mahout wikipediaXMLSplitter -d examples/temp/enwiki-latest-pages-articles.xml -o wikipedia/chunks -c 64 but got this error Exception in thread main java.lang.OutOfMemoryError: Java heap space There are many web pages regarding this and the solution is to add -Xmx 2048M for example. My question is, that option should be passed to java command and not Mahout. As result, running ./bin/mahout -Xmx 2048M shows that there is no such option. What should I do? Regards, Mahmood
Re: Welcome Andrew Musselman as new comitter
Hi Pavan, Committership is given for engagement with the project like providing documentation, answering questions on the mailinglist, reviewing patches, testing patches and submitting patches. We currently have a discussion ongoing about the future of mahout, feel free to participate. --sebastian On 03/07/2014 06:41 PM, Pavan Kumar N wrote: Congratulations to Andrew. Would be nice to have some information/background on how PMC evaluated Andrew to become committer. Also would be nice what future aspects/algorithms of machine learning is mahout is going to focus on. I have been keen to maintain code for one of the projects and mistakenly I spent time on developing map reduce version of weighted linear regression solutions procedure. Only recently I saw mahout's webpages are updated. Would appreciate any advice from Andrew and other PMC members. Pavan On 7 March 2014 22:56, Frank Scholten fr...@frankscholten.nl wrote: Congratulations Andrew! On Fri, Mar 7, 2014 at 6:12 PM, Sebastian Schelter s...@apache.org wrote: Hi, this is to announce that the Project Management Committee (PMC) for Apache Mahout has asked Andrew Musselman to become committer and we are pleased to announce that he has accepted. Being a committer enables easier contribution to the project since in addition to posting patches on JIRA it also gives write access to the code repository. That also means that now we have yet another person who can commit patches submitted by others to our repo *wink* Andrew, we look forward to working with you in the future. Welcome! It would be great if you could introduce yourself with a few words :) Sebastian
Welcome Andrew Musselman as new comitter
Hi, this is to announce that the Project Management Committee (PMC) for Apache Mahout has asked Andrew Musselman to become committer and we are pleased to announce that he has accepted. Being a committer enables easier contribution to the project since in addition to posting patches on JIRA it also gives write access to the code repository. That also means that now we have yet another person who can commit patches submitted by others to our repo *wink* Andrew, we look forward to working with you in the future. Welcome! It would be great if you could introduce yourself with a few words :) Sebastian
Re: Rework our website
Thank you very much! Could you create a jira ticket and post the links there? That would be awesome, then we can track that this stuff gets fixed. Best, Sebastian On 03/06/2014 02:58 PM, Kevin Moulart wrote: Hi I also prefer the second one. While I'm at it, there are several links that point to absent pages. I just clicked on all the link present on page : http://mahout.apache.org/users/basics/quickstart.html And those links are broken : http://mahout.apache.org/users/basics/recommender-documentation.html http://mahout.apache.org/users/classification/partial-implementation.html http://mahout.apache.org/users/basics/TasteCommandLine http://mahout.apache.org/users/recommender/recommendationexamples.html http://mahout.apache.org/users/basics/parallel-frequent-pattern-mining.html http://mahout.apache.org/users/basics/mahout.ga.tutorial.html http://hadoop.apache.org.html/ That's just the ones I found in 2 minutes on the quickstart page. Best Regards, Kevin 2014-03-05 23:43 GMT+01:00 Sebastian Schelter s...@apache.org: At the moment, only committers can change the website unfortunately. If you have a text to add, I'm happy to work it in and add your name to our contributers list in the CHANGELOG. Best, Sebastian On 03/05/2014 04:58 PM, Scott C. Cote wrote: I had recently taken the text tour of mahout, but I couldn't decipher a way to contribute updates to the tour (some of the file names have changed, etc). How would I start? (this was part of my offer to help with the documentation of Mahout). SCott On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote: What no centered text?? ;-) Love either. BTW users are no longer able to contribute content to the wiki. Most CMSs have a way to allow input that is moderated. Might this make getting documentation help easier? Allow anyone to contribute but committers can filter out the bad‹sort of like submitting patches. On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote: Hi everyone, In our latest discussion, I argued that the lack (and errors) of documentation on our website is one of the main pain points of Mahout atm. To be honest, I'm also not very happy with the design, especially fonts and spacing make it super hard to read long articles. This also prevents me from wanting to add articles and documentation. I think we should have a beautiful website, where it is fun to add new stuff. My design skills are pretty limited, but fortunately my brother is an art director! I asked him to make our website a bit more beautiful without changing to much of the structure, so that a redesign wouldn't take too long. I really like the results and would volunteer to dig out my CSS skills and do the redesign, if people agree. Here are his drafts, I like the second one best: https://people.apache.org/~ssc/mahout/mahout.jpg https://people.apache.org/~ssc/mahout/mahout2.jpg Let me know what you think! Best, Sebastian
Re: Rework our website
Could you add the missing pages to the jira issue? I'll have a look later. On 03/06/2014 03:25 PM, Suneel Marthi wrote: I fixed some of the broken links. For some of others eg: TasteCommandline, Recommendationexamples either the pages have not been migrated or the links have to be purged? On Thursday, March 6, 2014 9:07 AM, Sebastian Schelter s...@apache.org wrote: Thank you very much! Could you create a jira ticket and post the links there? That would be awesome, then we can track that this stuff gets fixed. Best, Sebastian On 03/06/2014 02:58 PM, Kevin Moulart wrote: Hi I also prefer the second one. While I'm at it, there are several links that point to absent pages. I just clicked on all the link present on page : http://mahout.apache.org/users/basics/quickstart.html And those links are broken : http://mahout.apache.org/users/basics/recommender-documentation.html http://mahout.apache.org/users/classification/partial-implementation.html http://mahout.apache.org/users/basics/TasteCommandLine http://mahout.apache.org/users/recommender/recommendationexamples.html http://mahout.apache.org/users/basics/parallel-frequent-pattern-mining.html http://mahout.apache.org/users/basics/mahout.ga.tutorial.html http://hadoop.apache.org.html/ That's just the ones I found in 2 minutes on the quickstart page. Best Regards, Kevin 2014-03-05 23:43 GMT+01:00 Sebastian Schelter s...@apache.org: At the moment, only committers can change the website unfortunately. If you have a text to add, I'm happy to work it in and add your name to our contributers list in the CHANGELOG. Best, Sebastian On 03/05/2014 04:58 PM, Scott C. Cote wrote: I had recently taken the text tour of mahout, but I couldn't decipher a way to contribute updates to the tour (some of the file names have changed, etc). How would I start? (this was part of my offer to help with the documentation of Mahout). SCott On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote: What no centered text?? ;-) Love either. BTW users are no longer able to contribute content to the wiki. Most CMSs have a way to allow input that is moderated. Might this make getting documentation help easier? Allow anyone to contribute but committers can filter out the bad‹sort of like submitting patches. On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote: Hi everyone, In our latest discussion, I argued that the lack (and errors) of documentation on our website is one of the main pain points of Mahout atm. To be honest, I'm also not very happy with the design, especially fonts and spacing make it super hard to read long articles. This also prevents me from wanting to add articles and documentation. I think we should have a beautiful website, where it is fun to add new stuff. My design skills are pretty limited, but fortunately my brother is an art director! I asked him to make our website a bit more beautiful without changing to much of the structure, so that a redesign wouldn't take too long. I really like the results and would volunteer to dig out my CSS skills and do the redesign, if people agree. Here are his drafts, I like the second one best: https://people.apache.org/~ssc/mahout/mahout.jpg https://people.apache.org/~ssc/mahout/mahout2.jpg Let me know what you think! Best, Sebastian
Re: Recommend items not rated by any user
Hi Juan, that is a good catch. CandidateItemsStrategy is the right place to implement this. Maybe we should simply extend its interface to add a parameter that says whether to keep or remove the current users items? We could even do this in the abstract base class then. --sebastian On 03/05/2014 10:42 AM, Juan José Ramos wrote: In case somebody runs into the same situation, the key seems to be in the CandidateItemStrategy being passed to the constructor of GenericItemBasedRecommender. Looking into the code, if no CandidateItemStrategy is specified in the constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and as the documentation says, the doGetCandidateItems method: returns all items that have not been rated by the user and that were preferred by another user that has preferred at least one item that the current user has preferred too. So, a different CandidateItemStrategy needs to be passed. For this problem, it seems to me that AllSimilarItemsCandidateItemsStrategy, AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody know where to find some documentation about the different CandidateItemStrategy? Based on the name I would say that: 1) AllSimilarItemsCandidateItemsStrategy returns all similar items regardless of whether they have been already rated by someone or not. 2) AllUnknownItemsCandidateItemsStrategy returns all similar items that have not been rated by anyone yet. Does anybody know if it works like that? Thanks. On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com wrote: First thing is thatI know this requirement would not make sense in a CF Recommender. In my case, I am trying to use Mahout to create something closer to a Content-Based Recommender. In particular, I am pre-computing a similarity matrix between all the documents (items) of my catalogue and using that matrix as the ItemSimilarity for my Item-Based Recommender. So, when a user rates a document, how could I make the recommender outputs similar documents to that ones the user has already rated even if no other user in the system has rated them yet? Is that even possible in the first place? Thanks a lot.
Rework our website
Hi everyone, In our latest discussion, I argued that the lack (and errors) of documentation on our website is one of the main pain points of Mahout atm. To be honest, I'm also not very happy with the design, especially fonts and spacing make it super hard to read long articles. This also prevents me from wanting to add articles and documentation. I think we should have a beautiful website, where it is fun to add new stuff. My design skills are pretty limited, but fortunately my brother is an art director! I asked him to make our website a bit more beautiful without changing to much of the structure, so that a redesign wouldn't take too long. I really like the results and would volunteer to dig out my CSS skills and do the redesign, if people agree. Here are his drafts, I like the second one best: https://people.apache.org/~ssc/mahout/mahout.jpg https://people.apache.org/~ssc/mahout/mahout2.jpg Let me know what you think! Best, Sebastian
Re: Recommend items not rated by any user
On 03/05/2014 01:23 PM, Juan José Ramos wrote: Thanks for the reply, Sebastian. I am not sure if that should be implemented in the Abstract base class though because for instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition, it returns the item not rated by the user and rated by somebody else. Good point. So we seem to need special implementations. Back to my last post, I have been playing around with AllSimilarItemsCandidateItemsStrategy and AllUnknownItemsCandidateItemsStrategy, and although they both do what I wanted (recommend items not previously rated by any user), I honestly can't tell the difference between the two strategies. In my tests the output was always the same. If the eventual output of the recommender will not include items already rated by the user as pointed out here ( http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E), AllSimilarItemsCandidateItemsStrategy should be equivalent to AllUnkownItemsCandidateItemsStrategy, shouldn't it? AllSimilarItems returns all items that are similar to any item that the user already knows. AllUnknownItems simply returns all items that the user has not interacted with yet. These are two different things, although they might overlap in some scenarios. Best, Sebastian Thanks. On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org wrote: Hi Juan, that is a good catch. CandidateItemsStrategy is the right place to implement this. Maybe we should simply extend its interface to add a parameter that says whether to keep or remove the current users items? We could even do this in the abstract base class then. --sebastian On 03/05/2014 10:42 AM, Juan José Ramos wrote: In case somebody runs into the same situation, the key seems to be in the CandidateItemStrategy being passed to the constructor of GenericItemBasedRecommender. Looking into the code, if no CandidateItemStrategy is specified in the constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and as the documentation says, the doGetCandidateItems method: returns all items that have not been rated by the user and that were preferred by another user that has preferred at least one item that the current user has preferred too. So, a different CandidateItemStrategy needs to be passed. For this problem, it seems to me that AllSimilarItemsCandidateItemsStrategy, AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody know where to find some documentation about the different CandidateItemStrategy? Based on the name I would say that: 1) AllSimilarItemsCandidateItemsStrategy returns all similar items regardless of whether they have been already rated by someone or not. 2) AllUnknownItemsCandidateItemsStrategy returns all similar items that have not been rated by anyone yet. Does anybody know if it works like that? Thanks. On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com wrote: First thing is thatI know this requirement would not make sense in a CF Recommender. In my case, I am trying to use Mahout to create something closer to a Content-Based Recommender. In particular, I am pre-computing a similarity matrix between all the documents (items) of my catalogue and using that matrix as the ItemSimilarity for my Item-Based Recommender. So, when a user rates a document, how could I make the recommender outputs similar documents to that ones the user has already rated even if no other user in the system has rated them yet? Is that even possible in the first place? Thanks a lot.
Re: Recommend items not rated by any user
So both strategies seems to be effectively the same, I don't know what the implementers had in mind when designing AllSimilarItemsCandidateItemsStrategy. It can take a long time to estimate preferences for all items a user doesn't know. Especially if you have a lot of items. Traditional item-based recommenders will not recommend any item that is not similar to at least one of the items the user interacted with, so AllSimilarItemsStrategy already selects the maximum set of items that could be potentially recommended to the user. --sebastian On 03/05/2014 05:38 PM, Tevfik Aytekin wrote: If the similarity between item 5 and two of the items user 1 preferred are not NaN then it will return 1, that is what I'm saying. If the similarities were all NaN then it will not return it. But surely, you might wonder if all similarities between an item and user's items are NaN, then AllUnknownItemsCandidateItemsStrategy probably will not return it. On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote: @Tevfik, running this recommender: GenericItemBasedRecommender itemRecommender = new GenericItemBasedRecommender(dataModel, itemSimilarity, new AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new AllSimilarItemsCandidateItemsStrategy(itemSimilarity)); With this dataModel: 1,1,1.0 1,2,2.0 1,3,1.0 1,4,2.0 2,1,1.0 2,2,4.0 And these similarities 1,2,0.1 1,3,0.2 1,4,0.3 2,3,0.5 3,4,0.5 5,1,0.2 5,2,1.0 Returns item 5 for User 1. So item 5 has not been preferred by user 1, and the similarity between item 5 and two of the items user 1 preferred are not NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So, I'm truly sorry to insist on this, but I still really do not get the difference. On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: Juan, You got me wrong, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. So, it does not simply return all items that have not been rated by the user. For example, if there is an item X which has not been rated by the user and if the similarity value between X and at least one of the items rated (preferred) by the user is not NaN, then X will be not be returned by AllSimilarItemsCandidateItemsStrategy, but it will be returned by AllUnknownItemsCandidateItemsStrategy. On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote: Hi Tefik, Thanks for the response. I think what you says contradicts what Sebastian pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user, what would AllUnknownItemsCandidateItemsStrategy return? On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Sorry there was a typo in the previous paragraph. If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Juan, If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value that is with at least one of the items preferred by the user. Tevfik On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org wrote: On 03/05/2014 01:23 PM, Juan José Ramos wrote: Thanks for the reply, Sebastian. I am not sure if that should be implemented in the Abstract base class though because for instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition, it returns the item not rated by the user and rated by somebody else. Good point. So we seem to need special implementations. Back to my last post, I have been playing around with AllSimilarItemsCandidateItemsStrategy and AllUnknownItemsCandidateItemsStrategy, and although they both do what I wanted (recommend items not previously rated by any user), I honestly can't tell the difference between the two strategies. In my tests the output was always the same. If the eventual output of the recommender will not include items already rated by the user as pointed out here ( http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E ), AllSimilarItemsCandidateItemsStrategy should be equivalent to AllUnkownItemsCandidateItemsStrategy, shouldn't it? AllSimilarItems returns all items that are similar to any item that the user already knows. AllUnknownItems simply returns all items that the user has not interacted with yet. These are two different things, although they might overlap in some scenarios. Best
Re: Recommend items not rated by any user
For SVD based algorithms, you would should use the AllUnknownItems Strategy then, thats correct. In the majority of industry usecases that I have seen, people use pre-computed item similarities (Mahout has lots of machinery for doing this, btw), so AllSimilarItems totally makes sense there. --sebastian On 03/05/2014 06:01 PM, Tevfik Aytekin wrote: It can even make things worse in SVD-based algorithms for which preference estimation is very fast. On Wed, Mar 5, 2014 at 7:00 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Sebastian, But in order not to select items that is not similar to at least one of the items the user interacted with you have to compute the similarity with all user items (which is the main task for estimating the preference of an item in item-based method). So, it seems to me that AllSimilarItemsStrategy does not bring much advantage over AllUnknownItemsCandidateItemsStrategy. On Wed, Mar 5, 2014 at 6:46 PM, Sebastian Schelter s...@apache.org wrote: So both strategies seems to be effectively the same, I don't know what the implementers had in mind when designing AllSimilarItemsCandidateItemsStrategy. It can take a long time to estimate preferences for all items a user doesn't know. Especially if you have a lot of items. Traditional item-based recommenders will not recommend any item that is not similar to at least one of the items the user interacted with, so AllSimilarItemsStrategy already selects the maximum set of items that could be potentially recommended to the user. --sebastian On 03/05/2014 05:38 PM, Tevfik Aytekin wrote: If the similarity between item 5 and two of the items user 1 preferred are not NaN then it will return 1, that is what I'm saying. If the similarities were all NaN then it will not return it. But surely, you might wonder if all similarities between an item and user's items are NaN, then AllUnknownItemsCandidateItemsStrategy probably will not return it. On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote: @Tevfik, running this recommender: GenericItemBasedRecommender itemRecommender = new GenericItemBasedRecommender(dataModel, itemSimilarity, new AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new AllSimilarItemsCandidateItemsStrategy(itemSimilarity)); With this dataModel: 1,1,1.0 1,2,2.0 1,3,1.0 1,4,2.0 2,1,1.0 2,2,4.0 And these similarities 1,2,0.1 1,3,0.2 1,4,0.3 2,3,0.5 3,4,0.5 5,1,0.2 5,2,1.0 Returns item 5 for User 1. So item 5 has not been preferred by user 1, and the similarity between item 5 and two of the items user 1 preferred are not NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So, I'm truly sorry to insist on this, but I still really do not get the difference. On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: Juan, You got me wrong, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. So, it does not simply return all items that have not been rated by the user. For example, if there is an item X which has not been rated by the user and if the similarity value between X and at least one of the items rated (preferred) by the user is not NaN, then X will be not be returned by AllSimilarItemsCandidateItemsStrategy, but it will be returned by AllUnknownItemsCandidateItemsStrategy. On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote: Hi Tefik, Thanks for the response. I think what you says contradicts what Sebastian pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user, what would AllUnknownItemsCandidateItemsStrategy return? On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Sorry there was a typo in the previous paragraph. If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Juan, If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value that is with at least one of the items preferred by the user. Tevfik On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org wrote: On 03/05/2014 01:23 PM, Juan José Ramos wrote: Thanks for the reply, Sebastian. I am not sure if that should be implemented in the Abstract base class though because for instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition, it returns the item not rated by the user and rated by somebody else. Good point. So we seem to need special
Re: Rework our website
At the moment, only committers can change the website unfortunately. If you have a text to add, I'm happy to work it in and add your name to our contributers list in the CHANGELOG. Best, Sebastian On 03/05/2014 04:58 PM, Scott C. Cote wrote: I had recently taken the text tour of mahout, but I couldn't decipher a way to contribute updates to the tour (some of the file names have changed, etc). How would I start? (this was part of my offer to help with the documentation of Mahout). SCott On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote: What no centered text?? ;-) Love either. BTW users are no longer able to contribute content to the wiki. Most CMSs have a way to allow input that is moderated. Might this make getting documentation help easier? Allow anyone to contribute but committers can filter out the bad‹sort of like submitting patches. On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote: Hi everyone, In our latest discussion, I argued that the lack (and errors) of documentation on our website is one of the main pain points of Mahout atm. To be honest, I'm also not very happy with the design, especially fonts and spacing make it super hard to read long articles. This also prevents me from wanting to add articles and documentation. I think we should have a beautiful website, where it is fun to add new stuff. My design skills are pretty limited, but fortunately my brother is an art director! I asked him to make our website a bit more beautiful without changing to much of the structure, so that a redesign wouldn't take too long. I really like the results and would volunteer to dig out my CSS skills and do the redesign, if people agree. Here are his drafts, I like the second one best: https://people.apache.org/~ssc/mahout/mahout.jpg https://people.apache.org/~ssc/mahout/mahout2.jpg Let me know what you think! Best, Sebastian
Re: Mahout-232-0.8.patch using
I think you should rather choose a different library that already offers an SVM than trying to revive a 4 year old patch. --sebastian On 03/04/2014 08:51 AM, Amol Kakade wrote: Hi, I am new user of Mahout and want to run sample SVM algorithm with Mahout. Can you please list me steps to use Mahout-232-0.8.patch for SVM in Mahout I have been trying for last 2 days but getting errors. -- Amol Kakade.
Re: how to recommend users already consumed items
I think we should introduce a new parameter for the recommend() method in the Recommender interface that tells whether already known items should be recommended or not. What do you think? Best, Sebastian On 03/04/2014 05:32 PM, Pat Ferrel wrote: I’d suggest a command line option if you want to submit a patch. Most people will want that line executed so the default should be the current behavior. But a large minority will want it your way. And please do submit a patch with the Jira, it will make your life easier when new releases come out you won’t have to manage a fork. On Mar 2, 2014, at 12:38 PM, Mario Levitin mariolevi...@gmail.com wrote: Juan, I don't understand your solution, if there are no ratings how can you blend the recommendations from the system and the user's already read news. Anyway, I think, as Pat does, the best way is to remove the mentioned line. It should be the responsibility of the business logic to remove user's items if needed. I will also create a Jira issue as you suggested. thanks On Sun, Mar 2, 2014 at 7:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Sun, Mar 2, 2014 at 8:52 AM, Pat Ferrel p...@occamsmachete.com wrote: You are not the only one to see this so I'd recommend creating an option for the Job, which will be checked before executing that line of code then submit it as a patch to the Jira you need to create in any case. That way it might get into the mainline and you won't have to maintain a fork. Avoiding the cost of a fork over a trivial issue like this is a grand idea.
Re: how to recommend users already consumed items
That's fine, I was talking about the non-distributed part only. This page has instructions on how to create patches: https://mahout.apache.org/developers/how-to-contribute.html Let me know if you need more infos! Best, Sebastian On 03/05/2014 12:27 AM, Mario Levitin wrote: I have created a Jira issue already. I only use the non-hadoop part of Mahout recommender algorithms. May be I can create a patch for that part. However, I have not done it before, and don't know how to proceed. On Wed, Mar 5, 2014 at 1:01 AM, Sebastian Schelter s...@apache.org wrote: Would you be willing to set up a jira issue and create a patch for this? --sebastian On 03/04/2014 11:58 PM, Mario Levitin wrote: I think we should introduce a new parameter for the recommend() method in the Recommender interface that tells whether already known items should be recommended or not. I agree (if the parameter is missing then defaults to current behavior as Pat suggested) On 03/04/2014 05:32 PM, Pat Ferrel wrote: I'd suggest a command line option if you want to submit a patch. Most people will want that line executed so the default should be the current behavior. But a large minority will want it your way. And please do submit a patch with the Jira, it will make your life easier when new releases come out you won't have to manage a fork. On Mar 2, 2014, at 12:38 PM, Mario Levitin mariolevi...@gmail.com wrote: Juan, I don't understand your solution, if there are no ratings how can you blend the recommendations from the system and the user's already read news. Anyway, I think, as Pat does, the best way is to remove the mentioned line. It should be the responsibility of the business logic to remove user's items if needed. I will also create a Jira issue as you suggested. thanks On Sun, Mar 2, 2014 at 7:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Sun, Mar 2, 2014 at 8:52 AM, Pat Ferrel p...@occamsmachete.com wrote: You are not the only one to see this so I'd recommend creating an option for the Job, which will be checked before executing that line of code then submit it as a patch to the Jira you need to create in any case. That way it might get into the mainline and you won't have to maintain a fork. Avoiding the cost of a fork over a trivial issue like this is a grand idea.
Re: Issue updating a FileDataModel
Hi Juan, IIRC then FileDataModel has a parameter that determines how much time must have been spent since the last modification of the underlying file. You can also directly append new data to the original file. If you want a to have a DataModel that can be concurrently updated, I suggest your data to a database. --sebastian On 03/02/2014 11:11 PM, Juan José Ramos wrote: I am having issues refreshing my recommender, in particular with the DataModel. I am using a FileDataModel and a GenericItemBasedRecommender that also has a CachingItemSimilarity wrapping a FileItemSimilarity. But for the test I am running I am making things even simpler. By the time I instantiate the recommender, these two files are in the FileSystem: data/datamodel.txt 0,1,0.0 data/datamodel.0.txt 0,2,1.0 And then I run the code you can find below: --- FileDataModel dataModel = new FileDataModel(new File(data/dataModel.txt )); FileItemSimilarity itemSimilarity = new FileItemSimilarity(new File( data/similarities)); GenericItemBasedRecommender itemRecommender = newGenericItemBasedRecommender(dataModel, itemSimilarity); System.out.println(Number of users in the system: + itemRecommender.getDataModel().getNumUsers()+ and + itemRecommender.getDataModel().getNumItems() + items); FileWriter writer = new FileWriter(new File(data/dataModel.1.txt)); writer.write(1,2,1.0\r); writer.close(); writer = new FileWriter(new File(data/dataModel.2.txt)); writer.write(2,2,1.0\r); writer.close(); writer = new FileWriter(new File(data/dataModel.3.txt)); writer.write(3,2,1.0\r); writer.close(); writer = new FileWriter(new File(data/dataModel.4.txt)); writer.write(4,2,1.0\r); writer.close(); writer = new FileWriter(new File(data/dataModel.5.txt)); writer.write(5,2,1.0\r); writer.close(); writer = new FileWriter(new File(data/dataModel.6.txt)); writer.write(6,2,1.0\r); writer.close(); itemRecommender.refresh(null); System.out.println(Number of users in the system: + itemRecommender.getDataModel().getNumUsers()+ and + itemRecommender.getDataModel().getNumItems() + items); --- The output is the same in both println: Number of users in the system: 2 and 2items. So, only the information from the files that were on the system by the time I run this test seem to get loaded on the DataModel. What can be causing that? Is there a maximum number of updates a FileDataModel can take up in every refresh? Could it be that actually by the time I call itemRecommender.refresh(null) the files have not been written to the FileSystem? Should I be calling refresh in a different manner? Thank you for your help.
Re: classification in standalone application in Apache Mahout 0.9
If you don't want to call a shell, I assume you don't want to use a Hadoop cluster, right? In that case, you should rather try Mahout's logistic regression classifier, which is tuned for usage on a single machine. --sebastian On 03/03/2014 03:07 PM, Hollow Quincy wrote: I am looking for simple example in Java (without any shell call) how to use NaiveBayesClassifier in Apache Mahout 0.9. I have a samples of text. I want to learn algorithm base on this data and that I want to classify a new text. class Main { public static void main(String[] args) { //learn algorithm base on some data //classify some data } } There is no example how to do it in Apache Mahout 0.9 ! Thanks for help
Re: classification in standalone application in Apache Mahout 0.9
Its certainly possible to run Hadoop on a single machine, but it will give you terrible performance. We don't have a single machine implementation of naive bayes, so I'd really suggest you use the logistic regression code. --sebastian On 03/03/2014 03:15 PM, Hollow Quincy wrote: You are right. I want to call my program on single machine in classic public static void main() standalone application. In my opinion Naive Bayes Classification would suit great to my problem. Is there a way to call it from my java code ? I cannot find any example.. Thanks for help 2014-03-03 15:11 GMT+01:00 Sebastian Schelter s...@apache.org: If you don't want to call a shell, I assume you don't want to use a Hadoop cluster, right? In that case, you should rather try Mahout's logistic regression classifier, which is tuned for usage on a single machine. --sebastian On 03/03/2014 03:07 PM, Hollow Quincy wrote: I am looking for simple example in Java (without any shell call) how to use NaiveBayesClassifier in Apache Mahout 0.9. I have a samples of text. I want to learn algorithm base on this data and that I want to classify a new text. class Main { public static void main(String[] args) { //learn algorithm base on some data //classify some data } } There is no example how to do it in Apache Mahout 0.9 ! Thanks for help
Re: Issue updating a FileDataModel
I think it depends on the difference between the time of the call to refresh() and the last modified time of the file. --sebastian On 03/03/2014 04:45 PM, Juan José Ramos wrote: Thanks for the reply, Sebastian. I do not have concurrent updates, but they actually may happen very, very close in time. Would the fact of adding the new preferences to new files or appending to the existing one make any difference or does everything depends on the time elapsed between two calls to recommender.refresh(null)? Many thanks. On Mon, Mar 3, 2014 at 1:18 PM, Sebastian Schelter s...@apache.org wrote: Hi Juan, IIRC then FileDataModel has a parameter that determines how much time must have been spent since the last modification of the underlying file. You can also directly append new data to the original file. If you want a to have a DataModel that can be concurrently updated, I suggest your data to a database. --sebastian On 03/02/2014 11:11 PM, Juan José Ramos wrote: I am having issues refreshing my recommender, in particular with the DataModel. I am using a FileDataModel and a GenericItemBasedRecommender that also has a CachingItemSimilarity wrapping a FileItemSimilarity. But for the test I am running I am making things even simpler. By the time I instantiate the recommender, these two files are in the FileSystem: data/datamodel.txt 0,1,0.0 data/datamodel.0.txt 0,2,1.0 And then I run the code you can find below: --- FileDataModel dataModel = new FileDataModel(new File(data/dataModel.txt )); FileItemSimilarity itemSimilarity = new FileItemSimilarity(new File( data/similarities)); GenericItemBasedRecommender itemRecommender = newGenericItemBasedRecommender(dataModel, itemSimilarity); System.out.println(Number of users in the system: + itemRecommender.getDataModel().getNumUsers()+ and + itemRecommender.getDataModel().getNumItems() + items); FileWriter writer = new FileWriter(new File(data/dataModel.1.txt)); writer.write(1,2,1.0\r); writer.close(); writer = new FileWriter(new File(data/dataModel.2.txt)); writer.write(2,2,1.0\r); writer.close(); writer = new FileWriter(new File(data/dataModel.3.txt)); writer.write(3,2,1.0\r); writer.close(); writer = new FileWriter(new File(data/dataModel.4.txt)); writer.write(4,2,1.0\r); writer.close(); writer = new FileWriter(new File(data/dataModel.5.txt)); writer.write(5,2,1.0\r); writer.close(); writer = new FileWriter(new File(data/dataModel.6.txt)); writer.write(6,2,1.0\r); writer.close(); itemRecommender.refresh(null); System.out.println(Number of users in the system: + itemRecommender.getDataModel().getNumUsers()+ and + itemRecommender.getDataModel().getNumItems() + items); --- The output is the same in both println: Number of users in the system: 2 and 2items. So, only the information from the files that were on the system by the time I run this test seem to get loaded on the DataModel. What can be causing that? Is there a maximum number of updates a FileDataModel can take up in every refresh? Could it be that actually by the time I call itemRecommender.refresh(null) the files have not been written to the FileSystem? Should I be calling refresh in a different manner? Thank you for your help.
Re: Mahout-232-0.8.patch using
Hi Amol, SVMs are not integrated in Mahout. I'd suggest you try our logistic regression classifier instead. Best, Sebastian On 03/04/2014 08:51 AM, Amol Kakade wrote: Hi, I am new user of Mahout and want to run sample SVM algorithm with Mahout. Can you please list me steps to use Mahout-232-0.8.patch for SVM in Mahout I have been trying for last 2 days but getting errors. -- Amol Kakade.
Re: parallelALS and RMSE TEST
The output of parallelALS are two matrices U and M whose product is an approximation of your input matrix. The matrices are outputed as sequence files with an IntWritable as key (the index of the row in the matrix) and a VectorWritable as value which holds the contents of the row vector. --sebastian On 02/27/2014 06:30 PM, AJ Rader wrote: Sean Owen srowen at gmail.com writes: Parallel ALS is exactly an example of where you can use matrix factorization for 0/1 data. On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin tevfik.aytekin at gmail.com wrote: Hi Sean, Isn't boolean preferences is supported in the context of memory-based recommendation algorithms in Mahout? Are there matrix factorization algorithms in Mahout which can work with this kind of data (that is, the kind of data which consists of users and the movies they have seen). On Mon, May 6, 2013 at 10:34 PM, Sean Owen srowen at gmail.com wrote: Yes, it goes by the name 'boolean prefs' in the project since target variables don't have values -- they just exist or don't. So, yes it's certainly supported but the question here is how to evaluate the output. On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin tevfik.aytekin at gmail.com wrote: This problem is called one-class classification problem. In the domain of collaborative filtering it is called one-class collaborative filtering (since what you have are only positive preferences). You may search the web with these key words to find papers providing solutions. I'm not sure whether Mahout has algorithms for one-class collaborative filtering. On Mon, May 6, 2013 at 1:42 PM, Sean Owen srowen at gmail.com wrote: ALS-WR weights the error on each term differently, so the average error doesn't really have meaning here, even if you are comparing the difference with 1. I think you will need to fall back to mean average precision or something. On Mon, May 6, 2013 at 11:24 AM, William icswilliam2010 at gmail.com wrote: Sean Owen srowen at gmail.com writes: If you have no ratings, how are you using RMSE? this typically measures error in reconstructing ratings. I think you are probably measuring something meaningless. I suppose the rate of seen movies are 1. Is it right? If I use Collaborative Filtering with ALS-WR to get some recommendations, I must have a real rating-matrix? I was wondering what kind of format the output produced by parallelALS is stored in. More specifically I am looking for a way to decode/read this information. I have been able to run the mahout parallelALS command, calculate RMSE using mahout evaluateFactorization, and generate recommendations via mahout recommendfactorized. However I would like to take a closer look at things like the factorized products for my probeSet (stored in --tempDir from the 'mahout evaluateFactorization' command) and the actual feature vectors stored in the /out/U/ and /out/M/ directories. thanks AJ
Re: Load output of rowsimilarity to memory
Hi Juan, It would definitely be nice to have that in the API! It would be great if you could submit a patch after you implemented this. Best, Sebastian On 02/25/2014 10:52 AM, Juan José Ramos wrote: Thanks for the answer. That was the approach I had in mind in the first place the only difference would be that I will write the output to a file that can be later used to create a FileItemSimilarity. I think that would be a very nice feature to have in the API. Thanks again. On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter s...@apache.org wrote: I overlooked that you're interested in document similarities. Sry again :) Another way would be to read the output of RowSimilarityJob with a o.a.m.common.iterator.sequencefile.SequenceFileDirIterable You create a list of instances of o.a.m.cf.taste.impl.similarity. GenericItemSimilarity.ItemItemSimilarity e.g. for the output Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...} you would do list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016)); list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565)); ... After that you create a GenericItemSimilarity from the list of ItemItemSimilarities, which is the in-memory item similarity you asked for. Hope that helps, Sebastian On 02/24/2014 10:04 PM, Juan José Ramos wrote: Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be for item-based CF? In particular, in the documentation I can read that: Preferences in the input file should look like userID,itemID[,preferencevalue] And in my case the input I have is just text documents and I want to pre-compute similarities between them beforehand, even before any user has expressed any preference value for any item. In order to use ItemSimilarityJob for this purpose, what should be the input I need to provide? Would it be the output of seq2sparse? Thanks again. On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter s...@apache.org wrote: You're right, my bad. If you don't use RowSimilarityJob directly, but org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob (which calls RowSimilarityJob under the covers), your output will be a textfile that is directly usable with FileItemSimilarity. --sebastian On 02/24/2014 09:30 PM, Juan José Ramos wrote: Thanks for the prompt reply. RowSimilarityJob produces an output in the form of: Key: 0: Value: {61112:0.21139380179557016, 52144:0.23797846026935565,...} whereas FileItemSimilarity is expecting a comma or tab separated inputs. I assume that you meant that the output of RowSimilarityJob can be loaded by the FileItemSimilarity after doing the appropriate parsing. Is that correct, or is there actually a way to load the raw output of RowSimilarityJob into FileItemSimilarity? Thanks. On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org wrote: The output of RowSimilarityJob can be loaded by the FileItemSimilarity. --sebastian On 02/24/2014 08:31 PM, Juan José Ramos wrote: Is there a way to reproduce this process: https://cwiki.apache.org/confluence/display/MAHOUT/ Quick+tour+of+text+analysis+using+the+Mahout+command+line inside Java code and not using the command line tool? I am not interested in the clustering part but in 'Calculate several similar docs to each doc in the data'. In particular, I am interested in loading the output of the rowsimilarity tool into memory to be used as my custom ItemSimilarity implementation for an ItemBasedRecommender. What I exactly want is to have a matrix in memory where for every doc in my catalogue I have the similarity with the 100 (that is the threshold I am using) most similar items an undefined similarity for the rest. Is it possible to do with the Java API? I know it can be done calling the commands from inside the Java code and I guess that also using corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and RowItemSimilarityJob. But I still see cannot see an easy way of parsing the output of RowItemSimilarityJob to the memory representation I intend to use. Thanks a lot.
Re: Load output of rowsimilarity to memory
If you iterate over the vector, you will get Vector.Element objects. elem.index() gives you the id of the similar thing, elem.get() gives you the similarity value. --sebastian On 02/25/2014 11:58 AM, Juan José Ramos wrote: Regarding the parsing of a VectorWriteble object, what is the recommended approach to access the different 'DocID: similarity' pairs? I can see that if I get the String representation of the org.apache.mahout.math.Vector object it should not be hard to parse using the text representation. However, is there a way to access the individual elements of the 'DocID: similarity' pair? I tried iterating through the individual Vector.Element objects and calling get(), but that does not return what I intend to. More than happy to contribute to the project once I get this working. Thanks a lot. On Tue, Feb 25, 2014 at 9:52 AM, Juan José Ramos jjar...@gmail.com wrote: Thanks for the answer. That was the approach I had in mind in the first place the only difference would be that I will write the output to a file that can be later used to create a FileItemSimilarity. I think that would be a very nice feature to have in the API. Thanks again. On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter s...@apache.orgwrote: I overlooked that you're interested in document similarities. Sry again :) Another way would be to read the output of RowSimilarityJob with a o.a.m.common.iterator.sequencefile.SequenceFileDirIterable You create a list of instances of o.a.m.cf.taste.impl.similarity. GenericItemSimilarity.ItemItemSimilarity e.g. for the output Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...} you would do list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016)); list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565)); ... After that you create a GenericItemSimilarity from the list of ItemItemSimilarities, which is the in-memory item similarity you asked for. Hope that helps, Sebastian On 02/24/2014 10:04 PM, Juan José Ramos wrote: Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be for item-based CF? In particular, in the documentation I can read that: Preferences in the input file should look like userID,itemID[,preferencevalue] And in my case the input I have is just text documents and I want to pre-compute similarities between them beforehand, even before any user has expressed any preference value for any item. In order to use ItemSimilarityJob for this purpose, what should be the input I need to provide? Would it be the output of seq2sparse? Thanks again. On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter s...@apache.org wrote: You're right, my bad. If you don't use RowSimilarityJob directly, but org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob (which calls RowSimilarityJob under the covers), your output will be a textfile that is directly usable with FileItemSimilarity. --sebastian On 02/24/2014 09:30 PM, Juan José Ramos wrote: Thanks for the prompt reply. RowSimilarityJob produces an output in the form of: Key: 0: Value: {61112:0.21139380179557016, 52144:0.23797846026935565,...} whereas FileItemSimilarity is expecting a comma or tab separated inputs. I assume that you meant that the output of RowSimilarityJob can be loaded by the FileItemSimilarity after doing the appropriate parsing. Is that correct, or is there actually a way to load the raw output of RowSimilarityJob into FileItemSimilarity? Thanks. On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org wrote: The output of RowSimilarityJob can be loaded by the FileItemSimilarity. --sebastian On 02/24/2014 08:31 PM, Juan José Ramos wrote: Is there a way to reproduce this process: https://cwiki.apache.org/confluence/display/MAHOUT/ Quick+tour+of+text+analysis+using+the+Mahout+command+line inside Java code and not using the command line tool? I am not interested in the clustering part but in 'Calculate several similar docs to each doc in the data'. In particular, I am interested in loading the output of the rowsimilarity tool into memory to be used as my custom ItemSimilarity implementation for an ItemBasedRecommender. What I exactly want is to have a matrix in memory where for every doc in my catalogue I have the similarity with the 100 (that is the threshold I am using) most similar items an undefined similarity for the rest. Is it possible to do with the Java API? I know it can be done calling the commands from inside the Java code and I guess that also using corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and RowItemSimilarityJob. But I still see cannot see an easy way of parsing the output of RowItemSimilarityJob to the memory representation I intend to use. Thanks a lot.
Re: Use Naïve Bayes on a large CSV
NaiveBayes expects a SequenceFile as input. The key is the class label as Text, the value are the features as VectorWritable. --sebastian On 02/24/2014 11:51 AM, Kevin Moulart wrote: Hi again, I finally set my mind on going through java to make a sequence file for the naive bayes, but I still can't manage to find anyplace stating exactly what should be in the sequence file for mahout to process it with Naive Bayes. I tried virtually every piece of code i found related to this subject, with no luck. My CSV file is like this : Label that I want to predict, feature 1, feature 2, ..., feature 1628 Could someone tell me exactly what Naive Bayes training procedure expects ? 2014-02-20 13:56 GMT+01:00 Jay Vyas jayunit...@gmail.com: This relates to a previous question I have: Does mahout have a concept of adapters which allow us to read data csv style data with filters to create exact format for its various inputs (i.e. Recommender three column format).? If not is it worth a jira? On Feb 20, 2014, at 7:50 AM, Kevin Moulart kevinmoul...@gmail.com wrote: Hi and thanks ! What about the command line, is there a way to do that using the existing command line ? 2014-02-20 12:02 GMT+01:00 Suneel Marthi suneel_mar...@yahoo.com: To convert input CSV to vectors, u can either: a) Use CSVIterator b) use InputDriver Either of the above should generate vectors from input CSV that could then be fed into Mahout classifier/clustering jobs. On Thursday, February 20, 2014 5:57 AM, Kevin Moulart kevinmoul...@gmail.com wrote: Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from the command line. I know I have to feed the classifier with a seq file, so I tried to put my csv into one using the command seqdirectory, but even when I try with a really small csv (less than 100Mo) I instantly get an outOfMemoryException from java heap space : mahout seqdirectory -i /user/cacf/Echant/testSeq -o /user/cacf/resSeq -ow MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar 14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/user/cacf/Echant/testSeq], --keyPrefix=[], --output=[/user/cacf/resSeq], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2367) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415) at java.lang.StringBuilder.append(StringBuilder.java:132) at org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62) at org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502) at org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) Do you have an idea or a simple way to use Naive Bayes against my large CSV ? Thanks in advance ! -- Kévin Moulart GSM France : +33 7 81 06 10 10 GSM Belgique : +32 473 85 23 85 Téléphone fixe : +32 2 771 88 45 -- Kévin Moulart GSM France : +33 7 81 06 10 10 GSM Belgique : +32 473 85 23 85 Téléphone fixe : +32 2 771 88 45
Re: Load output of rowsimilarity to memory
The output of RowSimilarityJob can be loaded by the FileItemSimilarity. --sebastian On 02/24/2014 08:31 PM, Juan José Ramos wrote: Is there a way to reproduce this process: https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line inside Java code and not using the command line tool? I am not interested in the clustering part but in 'Calculate several similar docs to each doc in the data'. In particular, I am interested in loading the output of the rowsimilarity tool into memory to be used as my custom ItemSimilarity implementation for an ItemBasedRecommender. What I exactly want is to have a matrix in memory where for every doc in my catalogue I have the similarity with the 100 (that is the threshold I am using) most similar items an undefined similarity for the rest. Is it possible to do with the Java API? I know it can be done calling the commands from inside the Java code and I guess that also using corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and RowItemSimilarityJob. But I still see cannot see an easy way of parsing the output of RowItemSimilarityJob to the memory representation I intend to use. Thanks a lot.
Re: Load output of rowsimilarity to memory
You're right, my bad. If you don't use RowSimilarityJob directly, but org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob (which calls RowSimilarityJob under the covers), your output will be a textfile that is directly usable with FileItemSimilarity. --sebastian On 02/24/2014 09:30 PM, Juan José Ramos wrote: Thanks for the prompt reply. RowSimilarityJob produces an output in the form of: Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...} whereas FileItemSimilarity is expecting a comma or tab separated inputs. I assume that you meant that the output of RowSimilarityJob can be loaded by the FileItemSimilarity after doing the appropriate parsing. Is that correct, or is there actually a way to load the raw output of RowSimilarityJob into FileItemSimilarity? Thanks. On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org wrote: The output of RowSimilarityJob can be loaded by the FileItemSimilarity. --sebastian On 02/24/2014 08:31 PM, Juan José Ramos wrote: Is there a way to reproduce this process: https://cwiki.apache.org/confluence/display/MAHOUT/ Quick+tour+of+text+analysis+using+the+Mahout+command+line inside Java code and not using the command line tool? I am not interested in the clustering part but in 'Calculate several similar docs to each doc in the data'. In particular, I am interested in loading the output of the rowsimilarity tool into memory to be used as my custom ItemSimilarity implementation for an ItemBasedRecommender. What I exactly want is to have a matrix in memory where for every doc in my catalogue I have the similarity with the 100 (that is the threshold I am using) most similar items an undefined similarity for the rest. Is it possible to do with the Java API? I know it can be done calling the commands from inside the Java code and I guess that also using corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and RowItemSimilarityJob. But I still see cannot see an easy way of parsing the output of RowItemSimilarityJob to the memory representation I intend to use. Thanks a lot.
Re: Load output of rowsimilarity to memory
I overlooked that you're interested in document similarities. Sry again :) Another way would be to read the output of RowSimilarityJob with a o.a.m.common.iterator.sequencefile.SequenceFileDirIterable You create a list of instances of o.a.m.cf.taste.impl.similarity.GenericItemSimilarity.ItemItemSimilarity e.g. for the output Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...} you would do list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016)); list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565)); ... After that you create a GenericItemSimilarity from the list of ItemItemSimilarities, which is the in-memory item similarity you asked for. Hope that helps, Sebastian On 02/24/2014 10:04 PM, Juan José Ramos wrote: Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be for item-based CF? In particular, in the documentation I can read that: Preferences in the input file should look like userID,itemID[,preferencevalue] And in my case the input I have is just text documents and I want to pre-compute similarities between them beforehand, even before any user has expressed any preference value for any item. In order to use ItemSimilarityJob for this purpose, what should be the input I need to provide? Would it be the output of seq2sparse? Thanks again. On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter s...@apache.org wrote: You're right, my bad. If you don't use RowSimilarityJob directly, but org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob (which calls RowSimilarityJob under the covers), your output will be a textfile that is directly usable with FileItemSimilarity. --sebastian On 02/24/2014 09:30 PM, Juan José Ramos wrote: Thanks for the prompt reply. RowSimilarityJob produces an output in the form of: Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...} whereas FileItemSimilarity is expecting a comma or tab separated inputs. I assume that you meant that the output of RowSimilarityJob can be loaded by the FileItemSimilarity after doing the appropriate parsing. Is that correct, or is there actually a way to load the raw output of RowSimilarityJob into FileItemSimilarity? Thanks. On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org wrote: The output of RowSimilarityJob can be loaded by the FileItemSimilarity. --sebastian On 02/24/2014 08:31 PM, Juan José Ramos wrote: Is there a way to reproduce this process: https://cwiki.apache.org/confluence/display/MAHOUT/ Quick+tour+of+text+analysis+using+the+Mahout+command+line inside Java code and not using the command line tool? I am not interested in the clustering part but in 'Calculate several similar docs to each doc in the data'. In particular, I am interested in loading the output of the rowsimilarity tool into memory to be used as my custom ItemSimilarity implementation for an ItemBasedRecommender. What I exactly want is to have a matrix in memory where for every doc in my catalogue I have the similarity with the 100 (that is the threshold I am using) most similar items an undefined similarity for the rest. Is it possible to do with the Java API? I know it can be done calling the commands from inside the Java code and I guess that also using corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and RowItemSimilarityJob. But I still see cannot see an easy way of parsing the output of RowItemSimilarityJob to the memory representation I intend to use. Thanks a lot.
Re: Mahout on Spark?
Completely agree with Sean's statement. On 02/19/2014 01:52 PM, Sean Owen wrote: To set expectations appropriately, I think it's important to point out this is completely infeasible short of a total rewrite, and I can't imagine that will happen. It may not be obvious if you haven't looked at the code how completely dependent on M/R it is. You can swap out M/R and Spark if you write in terms of something like Crunch, but that is not at all the case here. On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas jayunit...@gmail.com wrote: +100 for this, different execution engines, like the direction pig and crunch take Sent from my iPhone On Feb 19, 2014, at 5:19 AM, Gokhan Capan gkhn...@gmail.com wrote: I imagine in Mahout offering an option to the users to select from different execution engines (just like we currently do by giving M/R or sequential options), and starting from Spark. I am not sure what changes needed in the codebase, though. Maybe following MLI (or alike) and implementing some more stuff, such as common interfaces for iterating over data (the M/R way and the Spark way). IMO, another effort might be porting pre-online machine learning (such transforming text into vector based on the dictionary generated by seq2sparse before), machine learning based on mini-batches, and streaming summarization stuff in Mahout to Spark-Streaming. Best, Gokhan On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov dlie...@gmail.comwrote: PS I am moving along cost optimizer for spark-backed DRMs on some multiplicative pipelines that is capable of figuring different cost-based rewrites and R-Like DSL that mixes in-core and distributed matrix representations and blocks but it is painfully slow, i really only doing it like couple nights in a month. It does not look like i will be doing it on company time any time soon (and even if i did, the company doesn't seem to be inclined to contribute anything I do anything new on their time). It is all painfully slow, there's no direct funding for it anywhere with no string attached. That probably will be primary reason why Mahout would not be able to get much traction compared to university-based contributions. On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Unfortunately methinks the prospects of something like Mahout/MLLib merge seem very unlikely due to vastly diverged approach to the basics of linear algebra (and other things). Just like one cannot grow single tree out of two trunks -- not easily, anyway. It is fairly easy to port (and subsequently beat) MLib at this point from collection of algorithms point of view. But IMO goal should be more MLI-like first, and port second. And be very careful with concepts. Something that i so far don't see happening with MLib. MLib seems to be old-style Mahout-like rush to become a collection of basic algorithms rather than coherent foundation. Admittedly, i havent looked very closely. On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter s...@apache.org wrote: I'm also convinced that Spark is a superior platform for executing distributed ML algorithms. We've had a discussion about a change from Hadoop to another platform some time ago, but at that point in time it was not clear which of the upcoming dataflow processing systems (Spark, Hyracks, Stratosphere) would establish itself amongst the users. To me it seems pretty obvious that Spark made the race. I concur with Ted, it would be great to have the communities work together. I know that at least 4 mahout committers (including me) are already following Spark's mailinglist and actively participating in the discussions. What are the ideas how a fruitful cooperation look like? Best, Sebastian PS: I ported LLR-based cooccurrence analysis (aka item-based recommendation) to Spark some time ago, but I haven't had time to test my code on a large dataset yet. I'd be happy to see someone help with that. On 02/19/2014 08:04 AM, Nick Pentreath wrote: I know the Spark/Mllib devs can occasionally be quite set in ways of doing certain things, but we'd welcome as many Mahout devs as possible to work together. It may be too late, but perhaps a GSoC project to look at a port of some stuff like co occurrence recommender and streaming k-means? N -- Sent from Mailbox for iPhone On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath nick.pentre...@gmail.comwrote: My (admittedly heavily biased) view is Spark is a superior platform overall for ML. If the two communities can work together to leverage the strengths of Spark, and the large amount of good stuff in Mahout (as well as the fantastic depth of experience of Mahout devs) I think a lot can be achieved! It makes a lot of sense that Spark would be better than Hadoop for ML purposes given that Hadoop was intended to do web-crawl kinds of things and Spark was intentionally built to support machine
Re: Mahout on Spark?
I'm also convinced that Spark is a superior platform for executing distributed ML algorithms. We've had a discussion about a change from Hadoop to another platform some time ago, but at that point in time it was not clear which of the upcoming dataflow processing systems (Spark, Hyracks, Stratosphere) would establish itself amongst the users. To me it seems pretty obvious that Spark made the race. I concur with Ted, it would be great to have the communities work together. I know that at least 4 mahout committers (including me) are already following Spark's mailinglist and actively participating in the discussions. What are the ideas how a fruitful cooperation look like? Best, Sebastian PS: I ported LLR-based cooccurrence analysis (aka item-based recommendation) to Spark some time ago, but I haven't had time to test my code on a large dataset yet. I'd be happy to see someone help with that. On 02/19/2014 08:04 AM, Nick Pentreath wrote: I know the Spark/Mllib devs can occasionally be quite set in ways of doing certain things, but we'd welcome as many Mahout devs as possible to work together. It may be too late, but perhaps a GSoC project to look at a port of some stuff like co occurrence recommender and streaming k-means? N — Sent from Mailbox for iPhone On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath nick.pentre...@gmail.comwrote: My (admittedly heavily biased) view is Spark is a superior platform overall for ML. If the two communities can work together to leverage the strengths of Spark, and the large amount of good stuff in Mahout (as well as the fantastic depth of experience of Mahout devs) I think a lot can be achieved! It makes a lot of sense that Spark would be better than Hadoop for ML purposes given that Hadoop was intended to do web-crawl kinds of things and Spark was intentionally built to support machine learning. Given that Spark has been announced by a majority of the Hadoop-based distribution vendors, it makes sense that maybe Mahout should jump in. I really would prefer it if the two communities (MLib/MLI and Mahout) could work more closely together. There is a lot of good to be had on both sides.
Re: get similar items
Hi, Mahout's recommenders are based on analyzing interactions between users and items/movies, e.g. ratings or counts how often the movie was watched. On 02/12/2014 11:34 AM, N! wrote: Hi all: Does anyone have any suggestions for the questions below? thanks a lot. -- Original -- Sender: N!12481...@qq.com; Send time: Wednesday, Feb 12, 2014 6:17 PM To: useruser@mahout.apache.org; Subject: Re: get similar items Hi Sean: Thanks for the reply. Assume I have only one table named 'movie' with 1000+ records, this table have three columns:'id','movieName','movieDescription'. Can Mahout calculate the most similar movies for a movie.(based on only the 'movie' table)? code like: List mostSimilarMovieList = recommender.mostSimilar(int movieId). if not, do you have any suggestions for this scenario?
Re: Mahout algorithms
That is outdated unfortunately. I will send a list of current algorithms shortly. --sebastian On 02/05/2014 11:13 AM, Chameera Wijebandara wrote: Hi Sergey, This will help. https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms Thanks, Chameera On Wed, Feb 5, 2014 at 3:30 PM, Sergey Svinarchuk ssvinarc...@hortonworks.com wrote: Hi, Where can I see all algorithms which include mahout 0.9 and documentation for this algorithm? Thanks, Sergey! -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Mahout algorithms
Hi Sergey, here is the list of algorithms. We're currently in the progress of reworking our wiki, that's why the documentation is unfortunately incorrect at the moment. I've added a ticket for this: https://issues.apache.org/jira/browse/MAHOUT-1413 Here's the current list of algorithms in Mahout 0.9 Recommenders (non-distributed): - user-based collaborative filtering - item-based collaborative filtering - latent-factor models (SGD, SVD++, ALS) Recommender (distributed): - item-based collaborative filtering - latent-factor models (ALS) Classification (non-distributed): - logistic regression solved with SGD - Multilayer Perceptron - Hidden Markov Models Classification (distributed): - Naive Bayes - Random Forests Clustering (distributed) - Canopy - k-Means - streaming k-Means - fuzzy k-Means - spectral k-Means Topic Models (distributed) - Latent Dirichlet Allocation Frequent Pattern Mining (distributed) Math (distributed) - SVD using the Lanczos algorithm - Stochastic SVD Hope that helps. Best, Sebastian On 02/05/2014 11:00 AM, Sergey Svinarchuk wrote: Hi, Where can I see all algorithms which include mahout 0.9 and documentation for this algorithm? Thanks, Sergey!
Re: SGD classifier demo app
Would be great to add this as an example to Mahout's codebase. On 02/04/2014 10:27 AM, Ted Dunning wrote: Frank, I just munched on your code and sent a pull request. In doing this, I made a bunch of changes. Hope you liked them. These include massive simplification of the reading and vectorization. This wasn't strictly necessary, but it seemed like a good idea. More important was the way that I changed the vectorization. For the continuous values, I added log transforms. For the categorical values, I encoded as they are. I also increased the feature vector size to 100 to avoid excessive collisions. In the learning code itself, I got rid of the use of index arrays in favor of shuffling the training data itself. I also tuned the learning parameters a lot. The result is that the AUC that results is just a tiny bit less than 0.9 which is pretty close to what I got in R. For everybody else, see https://github.com/tdunning/mahout-sgd-bank-marketing for my version and https://github.com/tdunning/mahout-sgd-bank-marketing/compare/frankscholten:master...masterfor my pull request. On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning ted.dunn...@gmail.com wrote: Johannes, Very good comments. Frank, As a benchmark, I just spent a few minutes building a logistic regression model using R. For this model AUC on 10% held-out data is about 0.9. Here is a gist summarizing the results: https://gist.github.com/tdunning/8794734 On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte johannes.schu...@gmail.com wrote: Hi Frank, you are using the feature vector encoders which hash a combination of feature name and feature value to 2 (default) locations in the vector. The vector size you configured is 11 and this is imo very small to the possible combination of values you have for your data (education, marital, campaign). You can do no harm by using a much bigger cardinality (try 1000). Second, you are using a continuous value encoder with passing in the weight your are using as string (e.g. variable pDays). I am not quite sure about the reasons in th mahout code right now but the way it is implemented now, every unique value should end up in a different location because the continuous value is part of the hashing. Try adding the weight directly using a static word value encoder, addToVector(pDays,v,pDays) Last, you are also putting in the variable campaign as a continous variable which should be probably a categorical variable, so just added with a StaticWorldValueEncoder. And finally and probably most important after looking at your target variable: you are using a Dictionary for mapping either y or no to 0 or 1. This is bad. Depending on what comes first in the data set, either a positive or negative example might be 0 or 1, totally random. Make a hard mapping from the possible values (y/n?) to zero and one, having yes the 1 and no the zero. On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten fr...@frankscholten.nl wrote: Hi all, I am exploring Mahout's SGD classifier and like some feedback because I think I didn't properly configure things. I created an example app that trains an SGD classifier on the 'bank marketing' dataset from UCI: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing My app is at: https://github.com/frankscholten/mahout-sgd-bank-marketing The app reads a CSV file of telephone calls, encodes the features into a vector and tries to predict whether a customer answers yes to a business proposal. I do a few runs and measure accuracy but I'm I don't trust the results. When I only use an intercept term as a feature I get around 88% accuracy and when I add all features it drops to around 85%. Is this perhaps because the dataset highly unbalanced? Most customers answer no. Or is the classifier biased to predict 0 as the target code when it doesn't have any data to go with? Any other comments about my code or improvements I can make in the app are welcome! :) Cheers, Frank
Re: Mahout 0.9 Release
Hi Suneel, Thats great news, thank you for driving this release! On 02/02/2014 10:22 PM, Suneel Marthi wrote: Mahout 0.9 has been pushed to the mirrors and is available for download at http://www.apache.org/dyn/closer.cgi/mahout/ On Friday, January 31, 2014 11:21 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: The release has passed with the required votes from PMC, will be pushing 0.9 to the mirrors and updating the release notes over the next day or two. On Thursday, January 30, 2014 2:16 AM, Stevo Slavić ssla...@gmail.com wrote: +1 On Wed, Jan 29, 2014 at 10:56 PM, Shannon Quinn squ...@gatech.edu wrote: LGTM On 1/29/14, 4:27 PM, peng wrote: +1, can't see a bad side. On Wed 29 Jan 2014 11:33:02 AM EST, Suneel Marthi wrote: +1 from me On Wednesday, January 29, 2014 8:58 AM, Sebastian Schelter s...@apache.org wrote: +1 On 01/29/2014 05:25 AM, Andrew Musselman wrote: Looks good. +1 On Tue, Jan 28, 2014 at 8:07 PM, Andrew Palumbo ap@outlook.com wrote: a), b), c), d) all passed here. CosineDistance of clustered points from cluster-reuters.sh -1 kmeans were within the range [0,1]. Date: Tue, 28 Jan 2014 16:45:42 -0800 From: suneel_mar...@yahoo.com Subject: Mahout 0.9 Release To: user@mahout.apache.org; d...@mahout.apache.org Fixed the issues that were reported with Clustering code this past week, upgraded codebase to Lucene 4.6.1 that was released today. Here's the URL for the 0.9 release in staging:- https://repository.apache.org/content/repositories/ orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/ The artifacts have been signed with the following key: https://people.apache.org/keys/committer/smarthi.asc Please:- a) Verify that u can unpack the release (tar or zip) b) Verify u r able to compile the distro c) Run through the unit tests: mvn clean test d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through all the different options in each script. Need a minimum of 3 '+1' votes from PMC for the release to be finalized.
Re: generic latent variable recommender question
Case 1 is fine as is. For Case 2 I would suggest to simply experiment, try different similarity measures like euclidean distance or cosine and see what gives the best results. --sebastian On 01/25/2014 04:08 AM, Koobas wrote: A generic latent variable recommender question. I passed the user-item matrix through a low rank approximation, with either something like ALS or SVD, and now I have the feature vectors for all users and all items. Case 1: I want to recommend items to a user. I compute a dot product of the user’s feature vector with all feature vectors of all the items. I eliminate the ones that the user already has, and find the largest value among the others, right? Case 2: I want to find similar items for an item. Should I compute dot product of the item’s feature vector against feature vectors of all the other items? OR Should I compute the ANGLE between each par of feature vectors? I.e., compute the cosine similarity? I.e., normalize the vectors before computing the dot products? If “yes” for case 2, is that something I should also do for case 1?
Re: Pig local mode issue
I think this question is better suited for the mailinglist of the pig project. On 01/23/2014 01:24 AM, Sameer Tilak wrote: Hi All,My script runs find in map reduce mode, but I get the following error when I run it in the local mode. I have made sure that the i/p file exists. I am not sure why map reduce is coming into picture when it is local mode. pig -x local myscript.pig 2014-01-22 16:14:02,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false2014-01-22 16:14:02,805 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 32014-01-22 16:14:02,806 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 map-only splittees.2014-01-22 16:14:02,806 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 2 out of total 3 MR operators.2014-01-22 16:14:02,806 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 12014-01-22 16:14:02,845 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job2014-01-22 16:14:02,865 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.J o bControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.32014-01-22 16:14:02,876 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator2014-01-22 16:14:02,878 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=10 maxReducers=999 totalInputFileSize=99408652014-01-22 16:14:02,878 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 12014-01-22 16:14:02,909 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up multi store job2014-01-22 16:14:02,918 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.2014-01-22 16:14:02,918 [main] INFO org.apache.pig.data.Sc h emaTupleFrontend - Starting process to move generated code to distributed cacche2014-01-22 16:14:02,918 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Distributed cache not supported or needed in local mode. Setting key [pig.schematuple.local.dir] with code temp directory: /tmp/1390436042918-02014-01-22 16:14:02,978 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.2014-01-22 16:14:02,991 [JobControl] INFO org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library2014-01-22 16:14:02,994 [JobControl] ERROR org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:userid cause:ENOENT: No such file or directory2014-01-22 16:14:03,479 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete2014-01-22 16:14:03,489 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Oo o ps! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.2014-01-22 16:14:03,489 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job null has failed! Stop running all dependent jobs2014-01-22 16:14:03,490 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete2014-01-22 16:14:03,492 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backend error: ENOENT: No such file or directoryat org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:699) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:654) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFil e System.java:189)at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at
Re: Problem with ItemSimilarityJob, empty part-r-00000
Hi Quentin, Have you checked the log to ensure that you don't get any exceptions during the computation? Could you test the job with a tiny example where you can calculate the result by hand? Can you share an input file on which this job fails? --sebastian On 01/21/2014 11:22 AM, Quentin-Gabriel Thurier wrote: I encounter few troubles with Mahout that I can't sort out.. The context is that I'm trying to calculate pairwise euclidean distances between music tracks based on 6 audio features per track. My input for the mahout job is a text file which looks like this: feature_id,track_id,feature_value integer, integer,double This command works locally for less than 600 tracks (based on mahout-core-0.7-cdh4.5.0-job.jar): mahout itemsimilarity --input input/msd_sample/mahout --output output/mahout --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false --maxSimilaritiesPerItem 1 But for more tracks I get an empty file part-r-. I tried to decrease the --threshold parameter but I still don't have any result. I also tried to launch the job on aws EMR with the equivalent input for 3000 tracks (based on mahout-core-0.8-job.jar): org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input s3n://hadoop-filrouge/input/msd-sample/mahout --output s3n://hadoop-filrouge/output/mahout/01202014-itemsimilarity --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false --maxSimilaritiesPerItem 1 The job runs successfully but I get 17 empty part-r-000xx.. I'm totally stuck right now and I'm running out of idea to fix this issue. So if anydody only have a little idea of what is going on, that could really help. Many thanks,