Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair

2018-07-19 Thread Sebastian Schelter
Congrats!

2018-07-19 9:31 GMT+02:00 Peng Zhang :

> Congrats Andrew!
>
> On Thu, Jul 19, 2018 at 04:01 Andrew Musselman  >
> wrote:
>
> > Thanks Andy, looking forward to it! Thank you too for your support and
> > dedication the past two years; here's to continued progress!
> >
> > Best
> > Andrew
> >
> > On Wed, Jul 18, 2018 at 1:30 PM, Andrew Palumbo 
> > wrote:
> > > Please join me in congratulating Andrew Musselman as the new Chair of
> > > the
> > > Apache Mahout Project Management Committee. I would like to thank
> > > Andrew
> > > for stepping up, all of us who have worked with him over the years
> > > know his
> > > dedication to the project to be invaluable.  I look forward to Andrew
> > > taking taking the project into the future.
> > >
> > > Thank you,
> > >
> > > Andy
> >
>


Re: Does mahout 0.5 fit hadoop-0.20.2?

2014-06-25 Thread Sebastian Schelter
Please use a recent version of mahout. 0.4 and 0.5 are totally outdated.
-s

On 06/25/2014 09:05 AM, seabiscuit08 wrote:
 Hi everyone, i am new in mahout.
 Our hadoop cluster is hadoop-0.20.2 ,i try out mahout-distribution-0.4 lda 
 function, and it works well. But It can't inference new document when lda 
 estimation is over.
 I heard mahout0.5 has such ability,but when i try it ,it can't even create 
 sequence file on my hdfs.
 Any help is appreciated!!!
 
 
 
 
 seabiscuit08
 



Re: divide a vector (sum) by a double, error

2014-06-16 Thread Sebastian Schelter
Its also not a good idea to put the vectors into a hashset, i don't think
we have equals and hashcode correctly implemented for that
Am 16.06.2014 18:21 schrieb Ted Dunning ted.dunn...@gmail.com:

 Patrice,

 This sounds like a classpath problem more than code error.  Are you sure
 that you can run any program that use Mahout?  Do you perhaps have two
 versions of Mahout floating around?

 Regarding the code, this is a more compact idiom for the same thing:

 Matrix m = ...;
 Vector centroid = m.aggregateColumns(new VectorFunction() {
   @Override
   public double apply(Vector f) {
 return f.zSum() / f.size();
   }
 });

 This uses a matrix as a container for vectors rather than a set of Vectors.
  If you really want to use a set, then your iteration based approach should
 be fine.

 In your code, you could also be much tighter.  For instance, the last three
 lines could simply be:

 return sum.divide(vectors.size);

 None of the stuff with the Integer or casting is necessary.



 On Mon, Jun 16, 2014 at 9:01 AM, Patrice Seyed apse...@gmail.com wrote:

  Hi all,
 
 
  I have attempted to write a method centroid() that
 
  1) sums a HashSet of org.apache.mahout.math.Vector (vectors that are
  DenseVector), and
  2) (org.apache.mahout.math.Vector.divide) divides the summed vector by
  its size, as a double.
 
  I get an error:
 
  Exception in thread main java.lang.IncompatibleClassChangeError:
  class org.apache.mahout.math.function.Functions$1 has interface
  org.apache.mahout.math.function.DoubleFunction as super class
 
  I've tried this with a set of DenseVector and
  SequentialAccessSparseVector with the same result.
 
  Any help appreciated, the actual method is below.
  I noticed a class Centroid in the mahout distribution, but seems to
  cover a different sense of centroid than that I'm implementing here.
 
  Thanks,
  Patrice
 
 
  public Vector centroid (HashSetVector vectors){
 
  IteratorVector it = vectors.iterator();
 
  Vector sum = it.next();
 
  while(it.hasNext()){
 
  Vector aVector = it.next();
 
 
  sum = sum.plus(aVector);
 
  System.out.println(sum.toString());
 
  }
 
  Integer totalVectors = vectors.size();
 
  double dlTotalVectors = totalVectors.doubleValue();
 
  return sum.divide(dlTotalVectors);
 
 
  }
 



Re: Performance issues in Mahout recommendations

2014-06-06 Thread Sebastian Schelter
You should not use Hadoop for such a tiny dataset. Use the 
GenericItemBasedRecommender on a single machine in Java.


--sebastian

On 06/06/2014 11:10 AM, Warunika Ranaweera wrote:

Hi,

I am using Mahout's recommenditembased algorithm on a data set with nearly
10,000 (implicit) user ratings. This is the command I used:
*mahout recommenditembased --input ratings.csv --output recommendation
--usersFile users.dat --tempDir temp --similarityClassname
SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 *

Although the output is successfully generated, this process takes nearly 7
minutes to produce recommendations for a single user. The Hadoop cluster
has 8 nodes and the machine on which Mahout is invoked is an AWS EC2
c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that more
than one machine is *not* utilized at a time, and the *recommenditembased*
command takes 9 mapreduce jobs altogether with approx. 45 seconds taken per
job.

Since the performance is too slow for real time recommendations, it would
be really helpful to know whether I'm missing out any additional commands
or configurations that enables faster performance.

Thanks,
Warunikay





Re: Performance issues in Mahout recommendations

2014-06-06 Thread Sebastian Schelter
1M ratings take up something like 20 megabytes. This is a datasize where 
it does not make any sense to use Hadoop. Just try the single machine 
implementation.


--sebastian



On 06/06/2014 12:01 PM, Warunika Ranaweera wrote:

Hi Sebastian,

Thanks for your prompt response. It's just a sample data set from our
database and it may expand up to 6 million ratings. Since the performance
was low for a smaller data set, I thought it would be even worse for a
larger data set. As per your suggestion, I also applied the same command on
1 million user ratings for approx. 6000 users and got the same performance
level.

What is the average running time for the Mahout distributed recommendation
job on 1 million ratings? Does it usually take more than 1 minute?

Thanks in advance,
Warunika


On Fri, Jun 6, 2014 at 2:42 PM, Sebastian Schelter s...@apache.org wrote:


You should not use Hadoop for such a tiny dataset. Use the
GenericItemBasedRecommender on a single machine in Java.

--sebastian


On 06/06/2014 11:10 AM, Warunika Ranaweera wrote:


Hi,

I am using Mahout's recommenditembased algorithm on a data set with nearly
10,000 (implicit) user ratings. This is the command I used:
*mahout recommenditembased --input ratings.csv --output recommendation

--usersFile users.dat --tempDir temp --similarityClassname
SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 *


Although the output is successfully generated, this process takes nearly 7
minutes to produce recommendations for a single user. The Hadoop cluster
has 8 nodes and the machine on which Mahout is invoked is an AWS EC2
c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that more
than one machine is *not* utilized at a time, and the *recommenditembased*

command takes 9 mapreduce jobs altogether with approx. 45 seconds taken
per
job.

Since the performance is too slow for real time recommendations, it would
be really helpful to know whether I'm missing out any additional commands
or configurations that enables faster performance.

Thanks,
Warunikay










Re: Performance issues in Mahout recommendations

2014-06-06 Thread Sebastian Schelter

Mahout has single machine and distributed recommenders.


On 06/06/2014 02:31 PM, Warunika Ranaweera wrote:

I agree with your suggestion though. I have already implemented a Java
recommender and it performed better. But, due to scalability problems that
are predicted to occur in the future, we thought of moving to Mahout.
However, it seems like, for now, it's better to go with the single machine
implementation.

Thanks for your suggestions,
Warunika



On Fri, Jun 6, 2014 at 3:36 PM, Sebastian Schelter s...@apache.org wrote:


1M ratings take up something like 20 megabytes. This is a datasize where
it does not make any sense to use Hadoop. Just try the single machine
implementation.

--sebastian




On 06/06/2014 12:01 PM, Warunika Ranaweera wrote:


Hi Sebastian,

Thanks for your prompt response. It's just a sample data set from our
database and it may expand up to 6 million ratings. Since the performance
was low for a smaller data set, I thought it would be even worse for a
larger data set. As per your suggestion, I also applied the same command
on
1 million user ratings for approx. 6000 users and got the same performance
level.

What is the average running time for the Mahout distributed recommendation
job on 1 million ratings? Does it usually take more than 1 minute?

Thanks in advance,
Warunika


On Fri, Jun 6, 2014 at 2:42 PM, Sebastian Schelter s...@apache.org
wrote:

  You should not use Hadoop for such a tiny dataset. Use the

GenericItemBasedRecommender on a single machine in Java.

--sebastian


On 06/06/2014 11:10 AM, Warunika Ranaweera wrote:

  Hi,


I am using Mahout's recommenditembased algorithm on a data set with
nearly
10,000 (implicit) user ratings. This is the command I used:
*mahout recommenditembased --input ratings.csv --output recommendation

--usersFile users.dat --tempDir temp --similarityClassname
SIMILARITY_LOGLIKELIHOOD --numRecommendations 3 *


Although the output is successfully generated, this process takes
nearly 7
minutes to produce recommendations for a single user. The Hadoop cluster
has 8 nodes and the machine on which Mahout is invoked is an AWS EC2
c3.2xlarge server. When I tracked the mapreduce jobs, I noticed that
more
than one machine is *not* utilized at a time, and the
*recommenditembased*

command takes 9 mapreduce jobs altogether with approx. 45 seconds taken
per
job.

Since the performance is too slow for real time recommendations, it
would
be really helpful to know whether I'm missing out any additional
commands
or configurations that enables faster performance.

Thanks,
Warunikay















Re: Indicator Matrix and Mahout + Solr recommender

2014-05-27 Thread Sebastian Schelter
I have added the threshold merely as a way to increase the performance 
of RowSimilarityJob. If a threshold is given, some item pairs don't need 
to be looked at. A simple example is if you use cooccurrence count as 
similarity measure, and set a threshold of n cooccurrences, than any 
pair containing an item with less than n interactions can be ignored. 
IIRC similar techniques are implemented for cosine and jaccard.


Best,
Sebastian



On 05/27/2014 07:08 PM, Pat Ferrel wrote:


On May 27, 2014, at 8:15 AM, Ted Dunning ted.dunn...@gmail.com wrote:

The threshold should not normally be used in the Mahout+Solr deployment
style.


Understood and that’s why an alternative way of specifying a cutoff may be a 
good idea.



This need is better supported by specifying the maximum number of
indicators.  This is mathematically equivalent to specifying a fraction of
values, but is more meaningful to users since good values for this number
are pretty consistent across different uses (50-100 are reasonable values
for most needs larger values are quite plausible).


Assume you mean 50-100 as the average number per item.

The total for the entire indicator matrix is what Ken was asking for. But I was 
thinking about the use with itemsimilarity where the user may not know the 
dimensionality since itemsimilarity assembles the matrix from individual prefs. 
The user probably knows the number of items in their catalog but the indicator 
matrix dimensionality is arbitrarily smaller.

Currently the help reads:
--maxSimilaritiesPerItem (-m) maxSimilaritiesPerItemtry to cap the number 
of similar items per  item to this number  (default: 100)

If this were actually the average # per item it would do what you describe but 
it looks like it’s a literal a cutoff per vector in the code.

A cutoff based on the highest scores in the entire matrix seems to imply a sort 
when the total is larger than the average would allow and I don’t see an 
obvious sort being done in the MR.

Anyway, it looks like we could do this by
1) total number of values in the matrix (what Ken was asking for) This requires 
that the user know the dimensionality of the indicator matrix to be very useful.
2) average number per item (what Ted describes) This seems the most intuitive 
and does not require the dimensionality be known
3) fraction of the values. This might be useful if you are more interested in 
downsampling by score, at least it seems more useful than —threshold as it is 
today but maybe I’m missing some use cases? Is there really a need for a hard 
score threshold?





On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel pat.fer...@gmail.com wrote:


I was talking with Ken Krugler off list about the Mahout + Solr
recommender and he had an interesting request.

When calculating the indicator/item similarity matrix using
ItemSimilarityJob there is a  --threshold option. Wouldn’t it be better to
have an option that specified the fraction of values kept in the entire
matrix based on their similarity strength? This is very difficult to do
with --threshold. It would be like expressing the threshold as a fraction
of total number of values rather than a strength value. Seems like this
would have the effect of tossing the least interesting similarities where
limiting per item (—maxSimilaritiesPerItem) could easily toss some of the
most interesting.

At very least it seems like a better way of expressing the threshold,
doesn’t it?






Re: Theory behind LogisticRegression in Mahout

2014-05-23 Thread Sebastian Schelter

We should add these links to the LR page on the website.

--s
On 05/23/2014 03:20 PM, Ted Dunning wrote:

Ahh... my error then.

Happily, Dmitriy and others have provided the requisite links.


On Thu, May 22, 2014 at 11:50 PM, namit maheshwari 
namitmaheshwa...@gmail.com wrote:


No I didnt find any links in the comments.


On Fri, May 23, 2014 at 2:44 AM, Ted Dunning ted.dunn...@gmail.com
wrote:



I thought that there were links in comments to documentation.

Are there not?

Sent from my iPhone


On May 22, 2014, at 2:29, namit maheshwari namitmaheshwa...@gmail.com


wrote:


Hello Everyone,

Could anyone please let me know the algorithm used behind
LogisticRegression in Mahout. Also AdaptiveLogisticRegression mentions

an

*annealing* schedule.

I would be grateful if someone could guide me towards the theory behind

it.


Thanks
Namit










Re: Setting mahout heapsize for rowsimilarity job

2014-05-23 Thread Sebastian Schelter
I don't think you should use RowSimilarity job for that case, if you 
only have 6 columns.


Can you tell us a little bit about the data and what problem your are 
trying to solve?


--sebastian


On 05/23/2014 09:03 PM, Suneel Marthi wrote:

I had seen this issue too with RSJ until 0.8. Switch to using Mahout 0.9,
downsampling was introduced in RSJ which should avoid this error.


On Fri, May 23, 2014 at 2:59 PM, Mohit Singh mohit1...@gmail.com wrote:


Hi,
I have a 1M X 6 dimensional matrix stored as sequence file and I am
trying to use rowSimilarity for this job...
But when I try to run the job, I see Java heap space error for the second
step (RowSimilarityJob-CooccurrencesMapper-Reducer) .
My raw sequence file is around 700MB and then I have already set
MAHOUT_OPTS to (say) 7gb?
But I am still seeing that error?
My command line args are:

hadoop jar /usr/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob -i
$INPUT -o $OUTPUT *-r 6 *-s SIMILARITY_COSINE -m 15 --tempDir $TEMP -ess

Also, is this r a typo.. the help file says that this is column length?
Is it column or row dimension ?

Thanks

--
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates







Re: Mahout recommendation in implicit feedback situation

2014-05-05 Thread Sebastian Schelter

Alessandro,

which version of Mahout are you using? I had a look at the current 
implementation of GenericBooleanPrefUserBasedRecommender and its
doEstimatePreference method returns the sum of similarities of users 
that have also interacted with the item. So that should be different 
from either 0 or 1.


--sebastian

On 05/03/2014 05:00 PM, Alessandro Suglia wrote:

Sorry Sebastian, maybe you haven't the possibility to read the post on
SO, so I'll report the code here.
I've already used the GenericBooleanPrefUserBasedRecommender in order to
generate the recommendation and the results are the same.

| DataModel  trainModel=  new  FileDataModel(new
File(String.valueOf(Main.class.getResource(/binarized/u1.base).getFile(;

 DataModel  testModel=  new  FileDataModel(new
File(String.valueOf(Main.class.getResource(/binarized/u1.test).getFile(;

 UserSimilarity  similarity=  new
TanimotoCoefficientSimilarity(trainModel);
 UserNeighborhood  neighborhood=  new  NearestNUserNeighborhood(35,
similarity,  trainModel);

 GenericBooleanPrefUserBasedRecommender  userBased=  new
GenericBooleanPrefUserBasedRecommender(trainModel,  neighborhood,
similarity);

 long  firstUser=  testModel.getUserIDs().nextLong();  // get the
first user

 // try to recommender items for the first user
 for(LongPrimitiveIterator  iterItem=
testModel.getItemIDsFromUser(firstUser).iterator();
iterItem.hasNext();  )  {
 long  currItem=  iterItem.nextLong();
 // estimates preference for the current item for the first user
 System.out.println(Estimated preference for item  +
currItem+   is  +  userBased.estimatePreference(firstUser,  currItem));

 }

|

Can you explain to me where is the error in this code?

Thank you.

On 05/03/14 16:42, Sebastian Schelter wrote:

You should try the

org.apache.mahout.cf.taste.impl.recommender.GenericBooleanPrefUserBasedRecommender


which has been built to handle such data.

Best,
Sebastian


On 05/03/2014 04:34 PM, Alessandro Suglia wrote:

I have described it in the SO's post:
When I execute this code, the result is a list of 0.0 or 1.0 which are
not useful in the context of top-n recommendation in implicit feedback
context. Simply because I have to obtain, for each item, an estimated
rate which stays in the range [0, 1] in order to rank the list in
decreasing order and construct the top-n recommendation appropriately.
On 05/03/14 16:25, Sebastian Schelter wrote:

Hi Allessandro,

what result do you expect and what do you get? Can you give a concrete
example?

--sebastian

On 05/03/2014 12:11 PM, Alessandro Suglia wrote:

Good morning,
I've tried to create a recommender system using Mahout in an implicit
feedback situation. What I'm trying to do is explained exactlly in
this
post on stack overflow:
http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation.


http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation



As you can see, I'm having some problem with it simply because I
cannot
get the result that I expect (a value between 0 and 1) when I try to
predict a score for a specific item.

Someone here can help me, please?

Thank you in advance.

Alessandro Suglia














Re: Fwd: Mahout Naive Bayes CSV Classification

2014-05-04 Thread Sebastian Schelter

Hi Jossef,

You have to vectorize and normalize your data. The input for naive bayes 
is a sequencefile containing a Text object as key (your label) and a 
VectorWritable that holds a vector with the data.


Instructions to run NaiveBayes can be found here:

https://mahout.apache.org/users/classification/bayesian.html

--sebastian


On 05/03/2014 07:40 PM, Jossef Harush wrote:

I have these 2 CSV files:

1. train-set.csv
2. test-set.csv

Both of them are in the same structure (with different content) and similar
to this example (http://i.stack.imgur.com/jsckr.png) :

[image: enter image description here]

Each column is a feature and the last column - class, is the name of the
class to predict.

.

*Can anyone please provide a sample code for:*

1. Initializing Naive Bayes with a CSV file (model creation, training,
required pre-processing, etc...)
2. For a given CSV row - predicting a class

Thanks!

.

.

BTW -

I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow these
links:

http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

.
​





Re: Mahout recommendation in implicit feedback situation

2014-05-03 Thread Sebastian Schelter

Hi Allessandro,

what result do you expect and what do you get? Can you give a concrete 
example?


--sebastian

On 05/03/2014 12:11 PM, Alessandro Suglia wrote:

Good morning,
I've tried to create a recommender system using Mahout in an implicit
feedback situation. What I'm trying to do is explained exactlly in this
post on stack overflow:
http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation.
http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation

As you can see, I'm having some problem with it simply because I cannot
get the result that I expect (a value between 0 and 1) when I try to
predict a score for a specific item.

Someone here can help me, please?

Thank you in advance.

Alessandro Suglia





Re: Mahout recommendation in implicit feedback situation

2014-05-03 Thread Sebastian Schelter

You should try the

org.apache.mahout.cf.taste.impl.recommender.GenericBooleanPrefUserBasedRecommender

which has been built to handle such data.

Best,
Sebastian


On 05/03/2014 04:34 PM, Alessandro Suglia wrote:

I have described it in the SO's post:
When I execute this code, the result is a list of 0.0 or 1.0 which are
not useful in the context of top-n recommendation in implicit feedback
context. Simply because I have to obtain, for each item, an estimated
rate which stays in the range [0, 1] in order to rank the list in
decreasing order and construct the top-n recommendation appropriately.
On 05/03/14 16:25, Sebastian Schelter wrote:

Hi Allessandro,

what result do you expect and what do you get? Can you give a concrete
example?

--sebastian

On 05/03/2014 12:11 PM, Alessandro Suglia wrote:

Good morning,
I've tried to create a recommender system using Mahout in an implicit
feedback situation. What I'm trying to do is explained exactlly in this
post on stack overflow:
http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation.

http://stackoverflow.com/questions/23077735/mahout-recommendation-in-implicit-feedback-situation


As you can see, I'm having some problem with it simply because I cannot
get the result that I expect (a value between 0 and 1) when I try to
predict a score for a specific item.

Someone here can help me, please?

Thank you in advance.

Alessandro Suglia









Re: Future of Frequent Pattern Mining

2014-05-01 Thread Sebastian Schelter
I don't think we have to extract the code, people can pull it out of the 
0.9 releases sources which are in svn.


We have not heard any opposition from a production users of this code 
here, nor has someone stepped up to maintain this code (and we've asked 
for the second time), so let's finish what we already aimed for in the 
0.9 release and remove it.


I'll prepare a patch.

--sebastian

On 04/28/2014 10:52 AM, Ted Dunning wrote:

One thought is to extract the code, publish on github with warnings about
no support.  Then if there are requests, we can point them to the GH
archive and tell them to go for it.




On Mon, Apr 28, 2014 at 10:03 AM, Suneel Marthi smar...@apache.org wrote:


+100 to purging this from the codebase. This stuff uses the old MR api and
would have to be upgraded not to mention that this was removed from 0.9 and
was restored only because one user wanted it who promised to maintain it
and has not been heard from.




On Mon, Apr 28, 2014 at 2:19 AM, Sebastian Schelter s...@apache.org
wrote:


Hi,

I'm resending this mail to also include the users list. To wrap up: We
currently have a discussion whether our frequent pattern mining package
should stay in the codebase. The original author suggested to remove the
original implementation and maybe retain the FPGrowth2 implementation.

I'd like to ask our users here on their opionion, is anybody opposed to
removing the frequent pattern mining code from Mahout? Please shout out.

--sebastian









Future of Frequent Pattern Mining

2014-04-28 Thread Sebastian Schelter

Hi,

I'm resending this mail to also include the users list. To wrap up: We 
currently have a discussion whether our frequent pattern mining package 
should stay in the codebase. The original author suggested to remove the 
original implementation and maybe retain the FPGrowth2 implementation.


I'd like to ask our users here on their opionion, is anybody opposed to 
removing the frequent pattern mining code from Mahout? Please shout out.


--sebastian


Re: Future of Frequent Pattern Mining

2014-04-28 Thread Sebastian Schelter

Hi Michael,

the problem is that currently nodoby is maintaining the fpgrowth code 
anymore or working on documentation for it, that's why we consider it to 
be a candidate for removal. I don't see much value in keeping algorithms 
in the codebase if nobody is maintaining them, answering questions and 
providing documentation. If someone opposes here who has that code in 
production, that could be a reason to retain it however.


People wanting to use the code in the future can always download Mahout 
0.9 which has the current implementation.


--sebastian


On 04/28/2014 08:23 AM, Michael Wechner wrote:

what is the alternative and if one would still want to use the frequent
pattern mining code in the future, how
would this be possible otherwise?

Thanks

Michael

Am 28.04.14 08:19, schrieb Sebastian Schelter:

Hi,

I'm resending this mail to also include the users list. To wrap up: We
currently have a discussion whether our frequent pattern mining
package should stay in the codebase. The original author suggested to
remove the original implementation and maybe retain the FPGrowth2
implementation.

I'd like to ask our users here on their opionion, is anybody opposed
to removing the frequent pattern mining code from Mahout? Please shout
out.

--sebastian






Re: Reading the wiki

2014-04-28 Thread Sebastian Schelter
Would someone be willing to open a jira ticket for this issue and fix 
the problem?


--sebastian

On 04/28/2014 01:05 AM, Ted Dunning wrote:

Mathjax is both static content and server.

There is an FAQ about this https problem.  I think that part of the issue
is that they don't use the same URL for both http and https connections.

http://www.mathjax.org/resources/faqs/#problem-https

The URL that they suggest to use for getting mathjax.js is

https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js

This is required because the rackspace cdn won't allow the http address to
be used with https.  Perversely, this https address also breaks when used
with http.

My guess is that if we update our css/headers/templates to use this https
address then things will work.


On Sun, Apr 27, 2014 at 11:59 PM, Dmitriy Lyubimov dlie...@gmail.comwrote:


i think we would have to host mathjax to apease the browsers under https
handshake. I am not sure what would be associated with that, I am not sure
if mathjax is solely static content or it is an actual server doing
something.


On Sun, Apr 27, 2014 at 12:41 AM, Sebastian Schelter s...@apache.org
wrote:


What if we store a copy of the js file on our site and also serve it via
https?


On 04/27/2014 05:34 AM, Pat Ferrel wrote:


Often CMSs have a way to configure https access to be used only for
password or other secure areas of the site. No idea if the Apache CMS

does

this but worth asking. If there is no https fix seems like Mathjax

should

be discontinued.


On Apr 26, 2014, at 8:03 PM, Dmitriy Lyubimov dlie...@gmail.com

wrote:


I have no solution for https. It is most likely security thing.

I just asked that whomever writes blog to fix https links to simple
unsecure ones.
On Apr 26, 2014 6:24 PM, Andrew Musselman andrew.mussel...@gmail.com



wrote:

  There was chat last week about this breaking, something about https vs

http in the link to Mathjax as I recall.

Dmitriy was dealing with it last I saw.

  On Apr 26, 2014, at 6:04 PM, Pat Ferrel p...@occamsmachete.com

wrote:


I probably missed some announcement but why is the math markup coming


out raw? Do I need a plugin or something?




  \[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\

mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\
boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]














Re: Future of Frequent Pattern Mining

2014-04-28 Thread Sebastian Schelter

I'm very much in favor of this idea.

On 04/28/2014 10:52 AM, Ted Dunning wrote:

One thought is to extract the code, publish on github with warnings about
no support.  Then if there are requests, we can point them to the GH
archive and tell them to go for it.




On Mon, Apr 28, 2014 at 10:03 AM, Suneel Marthi smar...@apache.org wrote:


+100 to purging this from the codebase. This stuff uses the old MR api and
would have to be upgraded not to mention that this was removed from 0.9 and
was restored only because one user wanted it who promised to maintain it
and has not been heard from.




On Mon, Apr 28, 2014 at 2:19 AM, Sebastian Schelter s...@apache.org
wrote:


Hi,

I'm resending this mail to also include the users list. To wrap up: We
currently have a discussion whether our frequent pattern mining package
should stay in the codebase. The original author suggested to remove the
original implementation and maybe retain the FPGrowth2 implementation.

I'd like to ask our users here on their opionion, is anybody opposed to
removing the frequent pattern mining code from Mahout? Please shout out.

--sebastian









Re: Reading the wiki

2014-04-27 Thread Sebastian Schelter
What if we store a copy of the js file on our site and also serve it via 
https?


On 04/27/2014 05:34 AM, Pat Ferrel wrote:

Often CMSs have a way to configure https access to be used only for password or 
other secure areas of the site. No idea if the Apache CMS does this but worth 
asking. If there is no https fix seems like Mathjax should be discontinued.


On Apr 26, 2014, at 8:03 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

I have no solution for https. It is most likely security thing.

I just asked that whomever writes blog to fix https links to simple
unsecure ones.
On Apr 26, 2014 6:24 PM, Andrew Musselman andrew.mussel...@gmail.com
wrote:


There was chat last week about this breaking, something about https vs
http in the link to Mathjax as I recall.

Dmitriy was dealing with it last I saw.


On Apr 26, 2014, at 6:04 PM, Pat Ferrel p...@occamsmachete.com wrote:

I probably missed some announcement but why is the math markup coming

out raw? Do I need a plugin or something?




\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]







Welcome Pat Ferrel as new committer on Mahout

2014-04-24 Thread Sebastian Schelter

Hi,

this is to announce that the Project Management Committee (PMC) for 
Apache Mahout has asked Pat Ferrel to become committer and we are 
pleased to announce that he has accepted.


Being a committer enables easier contribution to the project since in 
addition to posting patches on JIRA it also gives write access to the 
code repository. That also means that now we have yet another person who 
can commit patches submitted by others to our repo *wink*


Pat, we look forward to working with you in the future. Welcome! It 
would be great if you could introduce yourself with a few words.


-s


Re: Spark Mahout with a CLI?

2014-04-20 Thread Sebastian Schelter

I'll create a jira ticket for this, as I have a little time to work on it.

On 04/16/2014 08:15 PM, Pat Ferrel wrote:

bug in the pseudo code, should use columnIds:

val hashedCrossIndicatorMatrix = new 
HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), 
hashedDrms(1).columnIds())
RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, 
hdfs://some/path/for/output”)

On Apr 16, 2014, at 10:00 AM, Pat Ferrel p...@occamsmachete.com wrote:

Great, and an excellent example is at hand. In it I will play the user and 
contributor role, Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender 
demo site. This creates logfiles of the form

timestamp, userId, itemId, action
timestamp1, userIdString1, itemIdString1, “view
timestamp2, userIdString2, itemIdString1, “like

These are currently processed using the Solr-recommender example code and 
Hadoop Mahout. The input is split and accumulated into two matrices which could 
then be input to the new Spark cooccurrence analysis code (see the patch here: 
https://issues.apache.org/jira/browse/MAHOUT-1464)

val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
maxInterestingItemsPerThing = 100, maxNumInteractions = 500, 
Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or 
maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible 
job that takes the above logfile input and creates a HashedSparseMatrix. inside the 
HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and 
column external Id - mahout Id lookup.

The ‘cooccurrences' call would be identical and the data it deals with would 
also be identical. But the HashedSparseMatrix would be able to deliver two 
dictionaries, which store the dimensions length and are used to lookup string 
Ids from internal mahout ordinal integer Ids. These could be created with a 
helper function to read from logfiles.

val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, 
“^actions-.*“, \t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new 
HasedSparceMatrix from the cooccurrences indicator matrix and the original 
itemId dictionaries:

val hashedCrossIndicatorMatrix = new 
HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), 
hasedDrms(1).rowIds())
RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, 
hdfs://some/path/for/output)

Here the two Id dictionaries are used to create output file(s) with external 
Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have 
to create a Spark impl of the wrapper for the new cross-cooccurrence indicator 
matrix. And since my scripting/web app language is not Scala the format for the 
output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. 
Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a 
HashedSparseMatrix, The internal calculations are done on RDDs and the drms are 
never written to disk. AND the logfiles can be consumed directly producing data 
that any language can consume directly with external Ids used and preserved.


BTW: in the MAHOUT-1464 example the drms are read in serially single threaded 
but written out using Spark (unless I missed something). In the proposed impl 
the read and write would be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using 
cron directly with no additional processing pipeline and by people unfamiliar 
with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the 
demo site but with a lot of non-Mahout code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be 
widely used. Otherwise we leave all of this wrapper code to be duplicated over 
and over again buy users and expect them to know too much about Spark Mahout 
internals.



On Apr 15, 2014, at 6:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.


On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel p...@occamsmachete.com wrote:


Which is why it’s an import/export issue.

On Apr 15, 2014, at 5:48 PM, Ted Dunning ted.dunn...@gmail.com wrote:

On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel p...@occamsmachete.com
wrote:


As to the statement There is not, nor do i think there will be a way to
run this stuff with CLI” seems unduly misleading. Really, does anyone
second this?

There will be Scala scripts to drive this stuff and yes even from the

CLI.

Do you imagine that every Mahout USER will be a Scala + Mahout DSL
programmer? That may be fine for commiters but users will be PHP devs,

Ruby


Re: org.apache.mahout.math.IndexException

2014-04-20 Thread Sebastian Schelter
Yes, it should give you the necessary information. The important part is 
this:


Apply the patch with patch -p 0 -i path to patch Throw a --dry-run on 
there if you want to see what happens w/o screwing up your checkout.


On 04/20/2014 09:47 PM, Mario Levitin wrote:

Thanks Sebastian,

I have not applied a patch before. I found the following page
http://mahout.apache.org/developers/patch-check-list.html

is that description enough for applying a patch?




On Sat, Apr 19, 2014 at 2:23 AM, Sebastian Schelter s...@apache.org wrote:


Mario,

could you check whether the patch from https://issues.apache.org/
jira/browse/MAHOUT-1517 fixes your problem?

Best,

Sebastian

On 04/18/2014 11:03 PM, Mario Levitin wrote:


In my dataset ID's are strings so I use MemoryIDMigrator. This migrator
produces large longs.
I'm not doing any translation.

I could not understand why there is a cast to int in the Mahout code. This
will produce errors for large long values.


On Fri, Apr 18, 2014 at 8:06 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

  Are you translating the ID's down into a range that will fit into int's?





On Thu, Apr 17, 2014 at 3:02 PM, Mario Levitin mariolevi...@gmail.com


wrote:



  Hi,


I'm trying to run the ALS algorithm. However, I get the following error:

Exception in thread pool-1-thread-3
org.apache.mahout.math.IndexException: Index -691877539 is outside
allowable range of [0,2147483647)
at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:395)
at


  org.apache.mahout.cf.taste.impl.recommender.svd.ALSWRFactorizer.

sparseUserRatingVector(ALSWRFactorizer.java:305)



At line 305 in ALSWRFactorizer.java, there is the following code

ratings.set((int) preference.getItemID(), preference.getValue());

My suspicion is that the error results from the casting to int in the


above


line. Item IDs in mahout are long, so if you cast a long (which does not
fit into an int) then you will get negative numbers and hence the error.

However, this explanation also seems to me implausible since I don't


think


such an error exists in Mahout code.

Any help will be appreciated.
Thanks














Re: simple idea for improving mahout docs over the next month?

2014-04-18 Thread Sebastian Schelter

Hm,

I'm not so sure whether introducing another source for documentation 
than the webpage would be so helpful (there still lots of work to do on 
the website...), how do others see this?


--sebastian

On 04/17/2014 05:06 PM, Jay Vyas wrote:

Hi sebastian:  theoretically, one could extract all the information from a
mailing list search but i think a rolling FAQ would much more (1) be
likely evolve into real documentation and (2) be more easily refined .  Is
that a little convincing ? If not i guess we can table the idea///  just a
thought.


On Thu, Apr 17, 2014 at 1:38 AM, Sebastian Schelter s...@apache.org wrote:


Hi Jay,

I'm not sure what the benefit of this approach is, people can already post
their questions to the mailinglist and get answers here, why would a google
doc be helpful?

--sebastian


On 04/16/2014 09:31 PM, Jay Vyas wrote:


hi mahout... i finally thought of a really easy way of ad-hoc improvement
of mahout docs, that can feed into the efforts to get formal docs
improved.

Any interest in creating a shared mahout FAQ file in a google doc.?

we can easily start adding questions into it that point to obvious missing
documentation parts, and mahout commiters can add responses below inline.
then overtime we can take those questions/answers and turn them directly
into real docs.

I think this will make it easier for a broader range of people to rapidly
improve mahout docs in an ad hoc sort of way.  i for one will volunteer to
help translate the QA stream into real documentation / JIRAs etc.











Re: Performance Issue using item-based approach!

2014-04-18 Thread Sebastian Schelter

You can, but you shouldn't :)

On 04/18/2014 07:23 PM, Ted Dunning wrote:

You can always run Hadoop in a local mode.  Nothing prevents a single node
from being a cluster.  :-)


On Thu, Apr 17, 2014 at 7:43 AM, Najum Ali naju...@googlemail.com wrote:


Ted,

Is it also possible to use ItemSimilarityJob in a non-distributed
environment?

Am 17.04.2014 um 16:22 schrieb Ted Dunning ted.dunn...@gmail.com:


Najum,

You should also be able to use the ItemSimilarityJob to compute a limited
indicator set.

This is stepping off of the path you have been on, but it would allow you
to deploy the recommender via a search engine.

That makes a lot of code simply vanish.  THis is also a well trod
production path.




On Thu, Apr 17, 2014 at 3:57 AM, Najum Ali naju...@googlemail.com

wrote:



@Sebastian

wow … you are right. The original csv file is about 21mb and the
corresponding precomputed item-item similarity file is about 260mb!!
And yes, there are wide more than 50 most similar items“ for an item ..

Trying to restrict this to 50 (or something like that) most similar

items

for an item could do the trick as you said.
Ok I will give it try and reply later.

By the way, what´s about the SampingCandidateItemsStrategy or something
like this, by using this Constructor:
*GenericItemBasedRecommender


https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.html#GenericItemBasedRecommender(org.apache.mahout.cf.taste.model.DataModel,%20org.apache.mahout.cf.taste.similarity.ItemSimilarity,%20org.apache.mahout.cf.taste.recommender.CandidateItemsStrategy,%20org.apache.mahout.cf.taste.recommender.MostSimilarItemsCandidateItemsStrategy)

*

(DataModel

https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/model/DataModel.html



dataModel, ItemSimilarity

https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/similarity/ItemSimilarity.html



similarity, CandidateItemsStrategy

https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/CandidateItemsStrategy.html



candidateItemsStrategy,MostSimilarItemsCandidateItemsStrategy

https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/MostSimilarItemsCandidateItemsStrategy.html



mostSimilarItemsCandidateItemsStrategy)


Am 17.04.2014 um 12:41 schrieb Sebastian Schelter s...@apache.org:

Hi Najum,

I think I found the problem. Remember: Two items are similar whenever at
least one user interacted with both of them (the items co-occur).

In the movielens dataset this is true for almost all pairs of items,
unfortunately. From 3076 items, more than 11 million similarities are
created. A common approach for that (which is not yet implemented in our
precomputation unfortunately) is to only retain the top-k similar items

per

item.

A solution would be to take the csv file that is created by the
MultithreadedBatchItemSimilarities and postprocess it so that only the

50

most similar items per item are retained. That should help with your
problem.

Unfortunately, we don't have code for that yet, maybe you want to try to
write that yourself?

Best,
Sebastian

PS: The user-based recommender restricts the number of similar users, I
guess thats why it is so fast here.


On 04/17/2014 12:18 PM, Najum Ali wrote:

Ok, here you go:

I have created a simple class with main-method (no server and other

stuff):


public class RecommenderTest {
public static void main(String[] args) throws IOException,

TasteException {

DataModel dataModel = new FileDataModel(new



File(/Users/najum/Documents/recommender-console/src/main/webapp/resources/preference_csv/1mil.csv));

ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel);
ItemBasedRecommender recommender = new
GenericItemBasedRecommender(dataModel,
similarity);

String pathToPreComputedFile = preComputeSimilarities(recommender,
dataModel.getNumItems());

InputStream inputStream = new FileInputStream(new
File(pathToPreComputedFile));
BufferedReader bufferedReader = new BufferedReader(new
InputStreamReader(inputStream));
CollectionGenericItemSimilarity.ItemItemSimilarity correlations =



bufferedReader.lines().map(mapToItemItemSimilarity).collect(Collectors.toList());

ItemSimilarity precomputedSimilarity = new
GenericItemSimilarity(correlations);
ItemBasedRecommender recommenderWithPrecomputation = new
GenericItemBasedRecommender(dataModel, precomputedSimilarity);

recommend(recommender);
recommend(recommenderWithPrecomputation);
}

private static String preComputeSimilarities(ItemBasedRecommender
recommender,
int simItemsPerItem) throws TasteException {
String pathToAbsolutePath = ;
try {
File resultFile = new File(System.getProperty(java.io.tmpdir),
similarities.csv);
if (resultFile.exists()) {
resultFile.delete();
}
BatchItemSimilarities batchJob = new
MultithreadedBatchItemSimilarities(recommender, simItemsPerItem);
int numSimilarities

Re: Installation on Ubuntu

2014-04-18 Thread Sebastian Schelter

Which version do you use, it shouldn't be a problem with oracle java.

--sebastian

On 04/18/2014 09:39 PM, Christopher Eugene wrote:

Hello,
I want to install mahout on Ubuntu 14.04. I had previously tried in vain to
install on 13.10. Could the version  of Java be the problem? I am compiling
from source. Any help will be appreciated.





Re: Installation on Ubuntu

2014-04-18 Thread Sebastian Schelter
That is wrong, but you could use a server such as PredictionIO (which 
uses Mahout internally) with PHP.


--sebastian

On 04/18/2014 09:49 PM, Christopher Eugene wrote:

@sebastian I have version 1.7. @Andrew I plan on using mahout with php
since I heard that there is a new API or am I wrong?


On Fri, Apr 18, 2014 at 10:45 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:


  [image: Boxbe] https://www.boxbe.com/overview This message is eligible
for Automatic Cleanup! (andrew.mussel...@gmail.com) Add cleanup 
rulehttps://www.boxbe.com/popup?url=https%3A%2F%2Fwww.boxbe.com%2Fcleanup%3Ftoken%3DmHSwpoBQ%252B6%252FJ3fW9yUA910ycGPeUT52Q%252Fal25IyYKsdhPwMs0QIM107VdsJQmYwJIZUxElWJcJOFczNqRvadXgKw58KV6DBHGzisKUyc7%252FXdNTfzycKNF8q7TqaJZzQWsiKseZB4uiAuGRbLb4mQVQ%253D%253D%26key%3DLq7NFbPs6NRMzQNN67fbd1t58GhHGdt2F%252F7YgWWx158%253Dtc_serial=16968089574tc_rand=991651927utm_source=stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001|
 More
infohttp://blog.boxbe.com/general/boxbe-automatic-cleanup?tc_serial=16968089574tc_rand=991651927utm_source=stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001

I would say if you want to get started, just grab the pre-built version via
the download button on the home page of http://mahout.apache.org

E.g., following those links you would end up here:
http://apache.cs.utah.edu/mahout/0.9 and then get either the -src or
non--src version and use the pre-built jars and examples.


On Fri, Apr 18, 2014 at 12:39 PM, Christopher Eugene
xriseug...@gmail.comwrote:


Hello,
I want to install mahout on Ubuntu 14.04. I had previously tried in vain

to

install on 13.10. Could the version  of Java be the problem? I am

compiling

from source. Any help will be appreciated.
--
Omar Christopher Eugene
http://about.me/mojo706











Re: Installation on Ubuntu

2014-04-18 Thread Sebastian Schelter

You can, but I'm not sure how much we can help you. Give it a try :)

On 04/18/2014 10:11 PM, Christopher Eugene wrote:

sorry I thought I replied to it :). I can ask predictionio related
questions on the list too?


On Fri, Apr 18, 2014 at 11:06 PM, Sebastian Schelter s...@apache.org wrote:


Please reply to the list, not to me in person :)


On 04/18/2014 10:05 PM, Christopher Eugene wrote:


Thank you Sebastian, I could've sworn I saw something involving mahout and
php not so long ago. Quick question are all the methods available in
mahout
available on PREDICTIONIO?


On Fri, Apr 18, 2014 at 10:53 PM, Sebastian Schelter s...@apache.org
wrote:

[image: Boxbe] https://www.boxbe.com/overview This message is

eligible
for Automatic Cleanup! (s...@apache.org) Add cleanup rule
https://www.boxbe.com/popup?url=https%3A%2F%2Fwww.
boxbe.com%2Fcleanup%3Ftoken%3DI1jJlussgKo%252FgNnu0piiTjSz4XM0mnIqukN5wT
%252BQRNmLPkyWOH0REpeI8f1ieFq90qMLvqA8YMt1NSyh5v7uv5blLasRGnu
Tyw%252F4uVI3zs%252BXKaoEm2vHJk54%252F1sEmGkvry98ht1MW0M%253D%
26key%3Dv33YAIUda%252F72bTRCeq4yfV92BTK%252FJZM1xG3rsd7W2bY%253Dtc_
serial=16968129293tc_rand=1599246981utm_source=stfutm_
medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001| More
infohttp://blog.boxbe.com/general/boxbe-automatic-
cleanup?tc_serial=16968129293tc_rand=1599246981utm_source=
stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001


That is wrong, but you could use a server such as PredictionIO (which
uses
Mahout internally) with PHP.

--sebastian

On 04/18/2014 09:49 PM, Christopher Eugene wrote:

  @sebastian I have version 1.7. @Andrew I plan on using mahout with php

since I heard that there is a new API or am I wrong?


On Fri, Apr 18, 2014 at 10:45 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 [image: Boxbe] https://www.boxbe.com/overview This message is


eligible
for Automatic Cleanup! (andrew.mussel...@gmail.com) Add cleanup rule
https://www.boxbe.com/popup?url=https%3A%2F%2Fwww.
boxbe.com%2Fcleanup%3Ftoken%3DmHSwpoBQ%252B6%
252FJ3fW9yUA910ycGPeUT52Q%
252Fal25IyYKsdhPwMs0QIM107VdsJQmYwJIZUxElWJcJOFczNqRvadXgKw5
8KV6DBHGzisKUyc7%252FXdNTfzycKNF8q7TqaJZzQWsiKs
eZB4uiAuGRbLb4mQVQ%253D%253D%26key%3DLq7NFbPs6NRMzQNN67fbd1t58GhH
Gdt2F%252F7YgWWx158%253Dtc_serial=16968089574tc_rand=
991651927utm_source=stfutm_medium=emailutm_campaign=
ANNO_CLEANUP_ADDutm_content=001| More
infohttp://blog.boxbe.com/general/boxbe-automatic-
cleanup?tc_serial=16968089574tc_rand=991651927utm_source=
stfutm_medium=emailutm_campaign=ANNO_CLEANUP_ADDutm_content=001

I would say if you want to get started, just grab the pre-built version
via
the download button on the home page of http://mahout.apache.org

E.g., following those links you would end up here:
http://apache.cs.utah.edu/mahout/0.9 and then get either the -src or
non--src version and use the pre-built jars and examples.


On Fri, Apr 18, 2014 at 12:39 PM, Christopher Eugene
xriseug...@gmail.comwrote:

   Hello,


I want to install mahout on Ubuntu 14.04. I had previously tried in
vain

  to


  install on 13.10. Could the version  of Java be the problem? I am


  compiling


  from source. Any help will be appreciated.

--
Omar Christopher Eugene
http://about.me/mojo706
























Re: org.apache.mahout.math.IndexException

2014-04-18 Thread Sebastian Schelter

Hi Mario,

this is indeed a bug. The problem is that the CF code (taste) uses long 
ids, while our math library internally uses int keys.


I'll open a jira and post patch that will hopefully help you.

--sebastian

On 04/18/2014 11:03 PM, Mario Levitin wrote:

In my dataset ID's are strings so I use MemoryIDMigrator. This migrator
produces large longs.
I'm not doing any translation.

I could not understand why there is a cast to int in the Mahout code. This
will produce errors for large long values.


On Fri, Apr 18, 2014 at 8:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:


Are you translating the ID's down into a range that will fit into int's?




On Thu, Apr 17, 2014 at 3:02 PM, Mario Levitin mariolevi...@gmail.com

wrote:



Hi,

I'm trying to run the ALS algorithm. However, I get the following error:

Exception in thread pool-1-thread-3
org.apache.mahout.math.IndexException: Index -691877539 is outside
allowable range of [0,2147483647)
at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:395)
at



org.apache.mahout.cf.taste.impl.recommender.svd.ALSWRFactorizer.sparseUserRatingVector(ALSWRFactorizer.java:305)


At line 305 in ALSWRFactorizer.java, there is the following code

ratings.set((int) preference.getItemID(), preference.getValue());

My suspicion is that the error results from the casting to int in the

above

line. Item IDs in mahout are long, so if you cast a long (which does not
fit into an int) then you will get negative numbers and hence the error.

However, this explanation also seems to me implausible since I don't

think

such an error exists in Mahout code.

Any help will be appreciated.
Thanks









Re: org.apache.mahout.math.IndexException

2014-04-18 Thread Sebastian Schelter

Mario,

could you check whether the patch from 
https://issues.apache.org/jira/browse/MAHOUT-1517 fixes your problem?


Best,
Sebastian

On 04/18/2014 11:03 PM, Mario Levitin wrote:

In my dataset ID's are strings so I use MemoryIDMigrator. This migrator
produces large longs.
I'm not doing any translation.

I could not understand why there is a cast to int in the Mahout code. This
will produce errors for large long values.


On Fri, Apr 18, 2014 at 8:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:


Are you translating the ID's down into a range that will fit into int's?




On Thu, Apr 17, 2014 at 3:02 PM, Mario Levitin mariolevi...@gmail.com

wrote:



Hi,

I'm trying to run the ALS algorithm. However, I get the following error:

Exception in thread pool-1-thread-3
org.apache.mahout.math.IndexException: Index -691877539 is outside
allowable range of [0,2147483647)
at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:395)
at



org.apache.mahout.cf.taste.impl.recommender.svd.ALSWRFactorizer.sparseUserRatingVector(ALSWRFactorizer.java:305)


At line 305 in ALSWRFactorizer.java, there is the following code

ratings.set((int) preference.getItemID(), preference.getValue());

My suspicion is that the error results from the casting to int in the

above

line. Item IDs in mahout are long, so if you cast a long (which does not
fit into an int) then you will get negative numbers and hence the error.

However, this explanation also seems to me implausible since I don't

think

such an error exists in Mahout code.

Any help will be appreciated.
Thanks









Re: simple idea for improving mahout docs over the next month?

2014-04-17 Thread Sebastian Schelter

Hi Najum,

please write a new mail to ask a question and don't reply to an 
unrelated thread -- https://people.apache.org/~hossman/#threadhijack


If you write a new mail, I'm sure we can help you with your recommender 
problem. Can you give us a few more details, such as the similarity that 
you used, how you did the precomputation and how you exactly measure the 
response time?


--sebastian



On 04/17/2014 10:49 AM, Najum Ali wrote:

Hi guys,

I´m pretty much new to mahout and I´m working with this problem here:

I have created a precomputed item-item-similarity collection for a 
GenericItemBasedRecommender.
Using the 1M MovieLens data, my item-based recommender is only 40-50% faster 
than without precomputation (like 589.5ms instead 1222.9ms).
But the user-based recommender instead is really fast, it´s like 24.2ms? How 
can this happen?

Why is item-based so slow?





Re: Performance Issue using item-based approach!

2014-04-17 Thread Sebastian Schelter
Could you take the output of the precomputation, feed it into a 
standalone recommender and test it there?



On 04/17/2014 11:37 AM, Najum Ali wrote:

@sebastian


Are you sure that the precomputation is done only once and not in every request?

Yes, a @Bean annotated Object is in Spring per default a singleton instance.
I also just tested it out using a System.out.println()
Here is my log:

System.out.println( precomputation done!“ is called before returning the
GenericItemSimilarity.

The first two recommendations are Item-based - pearson similarity
The thrid and 4th log are also item-based using pre computed similarity
The last log is the userbased recommender using pearson

Look at the huge time difference!

Am 17.04.2014 um 11:23 schrieb Sebastian Schelter s...@apache.org
mailto:s...@apache.org:


Najum,

this is really strange, feeding an ItemBased Recommender with precomputed
similarities should give you superfast recommendations.

Are you sure that the precomputation is done only once and not in every request?

--sebastian

On 04/17/2014 11:17 AM, Najum Ali wrote:

Hi guys,

I have created a precomputed item-item-similarity collection for a
GenericItemBasedRecommender.
Using the 1M MovieLens data, my item-based recommender is only 40-50% faster
than without precomputation (like 589.5ms instead 1222.9ms).
But the user-based recommender instead is really fast, it´s like 24.2ms? How can
this happen?

Here are more details to my Implementation:

CSV File: 1M pref, 6040 Users, 3706 Items

For my Implementation I´m using screenshots, because having the good
highlighting.
My Recommender runs inside a Webserver (Jetty) using Spring 4 and Java8. I
receive Recommendations as Webservice (JSON).

For DataModel, I´m using FileDataModel.


This code below creates me a precomputed ItemSimilarity when I start the
Webserver and the property isItemPreComputationEnabled is set to true:


For time measuring I´m using AOP. I´m measuring the whole time from entering my
Controller to sending the response.
based on System.nanoTime(); and getting the diff. It´s the same time measure for
user based.

I haved tried to cache the recommender and the similarity with no big
difference. I also tried to use CandidateItemsStrategy and
MostSimilarItemsCandidateItemsStrategy, but also no performance boost.

public RecommenderBuilder createRecommenderBuilder(ItemSimilarity similarity)
throws TasteException {
final int numberOfUsers = dataModel.getNumUsers();
final int numberOfItems = dataModel.getNumItems();
CandidateItemsStrategy candidateItemsStrategy = new
SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy = new
SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
return model - new GenericItemBasedRecommender(model,
similarity,candidateItemsStrategy,mostSimilarStrategy);
}

I dont know why item-based is taking so much longer then user-based. User-based
is like fast as hell. I even tried a DataSet using 100k Prefs, and 10Million
(Movielens). Everytime the user-based is soo much faster for any similarity.

Hope you anyone can help me to understand this. Maybe I´m doing something wrong.

Thanks!! :))











Re: Performance Issue using item-based approach!

2014-04-17 Thread Sebastian Schelter
Yes, just to make sure the problem is in the mahout code and not in the 
surrounding environment.


On 04/17/2014 11:43 AM, Najum Ali wrote:

@Sebastian
What do u mean with a standalone recommender? A simple offline java main 
program?

Am 17.04.2014 um 11:41 schrieb Sebastian Schelter s...@apache.org:


Could you take the output of the precomputation, feed it into a standalone 
recommender and test it there?


On 04/17/2014 11:37 AM, Najum Ali wrote:

@sebastian


Are you sure that the precomputation is done only once and not in every request?

Yes, a @Bean annotated Object is in Spring per default a singleton instance.
I also just tested it out using a System.out.println()
Here is my log:

System.out.println( precomputation done!“ is called before returning the
GenericItemSimilarity.

The first two recommendations are Item-based - pearson similarity
The thrid and 4th log are also item-based using pre computed similarity
The last log is the userbased recommender using pearson

Look at the huge time difference!

Am 17.04.2014 um 11:23 schrieb Sebastian Schelter s...@apache.org
mailto:s...@apache.org:


Najum,

this is really strange, feeding an ItemBased Recommender with precomputed
similarities should give you superfast recommendations.

Are you sure that the precomputation is done only once and not in every request?

--sebastian

On 04/17/2014 11:17 AM, Najum Ali wrote:

Hi guys,

I have created a precomputed item-item-similarity collection for a
GenericItemBasedRecommender.
Using the 1M MovieLens data, my item-based recommender is only 40-50% faster
than without precomputation (like 589.5ms instead 1222.9ms).
But the user-based recommender instead is really fast, it´s like 24.2ms? How can
this happen?

Here are more details to my Implementation:

CSV File: 1M pref, 6040 Users, 3706 Items

For my Implementation I´m using screenshots, because having the good
highlighting.
My Recommender runs inside a Webserver (Jetty) using Spring 4 and Java8. I
receive Recommendations as Webservice (JSON).

For DataModel, I´m using FileDataModel.


This code below creates me a precomputed ItemSimilarity when I start the
Webserver and the property isItemPreComputationEnabled is set to true:


For time measuring I´m using AOP. I´m measuring the whole time from entering my
Controller to sending the response.
based on System.nanoTime(); and getting the diff. It´s the same time measure for
user based.

I haved tried to cache the recommender and the similarity with no big
difference. I also tried to use CandidateItemsStrategy and
MostSimilarItemsCandidateItemsStrategy, but also no performance boost.

public RecommenderBuilder createRecommenderBuilder(ItemSimilarity similarity)
throws TasteException {
final int numberOfUsers = dataModel.getNumUsers();
final int numberOfItems = dataModel.getNumItems();
CandidateItemsStrategy candidateItemsStrategy = new
SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy = new
SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
return model - new GenericItemBasedRecommender(model,
similarity,candidateItemsStrategy,mostSimilarStrategy);
}

I dont know why item-based is taking so much longer then user-based. User-based
is like fast as hell. I even tried a DataSet using 100k Prefs, and 10Million
(Movielens). Everytime the user-based is soo much faster for any similarity.

Hope you anyone can help me to understand this. Maybe I´m doing something wrong.

Thanks!! :))







Re: Is there any website documentation repository or tool for Apache Mahout?

2014-04-17 Thread Sebastian Schelter
The templates for the individual pages are in the svn under site/ in
markdown format. You can use an online markdown editor to approximately see
how they look like.

We don't have a better solution yet, unfortunately.

--sebastian
Am 17.04.2014 20:09 schrieb Andrew Musselman andrew.mussel...@gmail.com:

 The content of the main part of each page is written in markdown and
 parsed by the CMS to render the HTML.  I'm not aware of a way to submit
 pages except as patches..

  On Apr 17, 2014, at 1:52 PM, Pat Ferrel p...@occamsmachete.com wrote:
 
  +1
 
  the project uses Confluence for the wiki. All but commiters are blocked
 from editing pages.
 
  This is getting increasingly frustrating. How many tickets and patches
 are being passed around now? I can’t follow them all. I haven’t used
 Confluence for 4-5 years now but there must be some way to allow edits and
 new pages from anyone pending approval to publish?
 
  On Apr 17, 2014, at 4:47 AM, tuxdna tux...@gmail.com wrote:
 
  I have seen the instructions here[1], but I am not sure if there is
  any source-code for the documentation for website.
 
  So here are my questions:
 
  * Does Apache Mahout project use any tool to generate website
  documentation as it is now http://mahout.apache.org ?
 
  * Suppose I want to add some correction or edition to current Apache
  Mahout documentation. Can I get a read-only access to the source of
  website, so that I can immediately see how the edits will reflect once
  there are accepted?
 
  I was thinking in terms of the way GitHub pages work. For example if I
  use Jekyll, I can view the changes on my machine, exactly as the will
  appear on final website.
 
 
  Regards,
  Saleem
 
 
  [1] http://mahout.apache.org/developers/how-to-update-the-website.html
  [2] https://pages.github.com/
  [3] http://jekyllrb.com/
 



Re: simple idea for improving mahout docs over the next month?

2014-04-16 Thread Sebastian Schelter

Hi Jay,

I'm not sure what the benefit of this approach is, people can already 
post their questions to the mailinglist and get answers here, why would 
a google doc be helpful?


--sebastian

On 04/16/2014 09:31 PM, Jay Vyas wrote:

hi mahout... i finally thought of a really easy way of ad-hoc improvement
of mahout docs, that can feed into the efforts to get formal docs improved.

Any interest in creating a shared mahout FAQ file in a google doc.?

we can easily start adding questions into it that point to obvious missing
documentation parts, and mahout commiters can add responses below inline.
then overtime we can take those questions/answers and turn them directly
into real docs.

I think this will make it easier for a broader range of people to rapidly
improve mahout docs in an ad hoc sort of way.  i for one will volunteer to
help translate the QA stream into real documentation / JIRAs etc.





Documentation, Documentation, Documentation

2014-04-13 Thread Sebastian Schelter

Hi,

this is another reminder that we still have to finish our documentation 
improvements! The website looks shiny now and there have been lots of 
discussions about new directions but we still have some work todo in 
cleaning up webpages. We should especially make sure that the examples work.


Please help with that, anyone who is willing to sacrifice some time, go 
through a website and try out the steps described is of great help to 
the project. It would also be awesome to get some help in creating a few 
new pages, especially for the recommenders.


Here's the list of documentation related jira's for 1.0:

https://issues.apache.org/jira/browse/MAHOUT-1441?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Documentation%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

Best,
Sebastian


Re: PreferenceArray userID uniqeness?

2014-04-11 Thread Sebastian Schelter

Yes, its a unique identifier for a user.

--sebastian

On 04/11/2014 04:41 PM, Mike Summers wrote:

Does the userId of a preferenceArray need to be unique across all entries
in a FastByIDMap?

I'm comparing two types of objects that contain the same set of traits
however it's possible that the userID (primary key) is not unique as it's
two db tables.

Thanks.





Re: Best practice for partial cartesian product

2014-04-08 Thread Sebastian Schelter
I don't know a good name for that. The problems is that a quadratic 
amount of pairs needs to be emitted here. In our collaborative filtering 
code, we solve this through downsampling.


--sebastian

On 04/08/2014 10:08 AM, Reinis Vicups wrote:

Hi,

this is not mahout question directly, but I figured that you guys most
likely can answer it.

Actually I have two questions:

1. This: {(1,2); (1,3); (2,3)} is not full cartesian product, right? It
is missing (1,1); (2,2); (3,3); (2,1); My question is - how is it
called? Partial cartesian? Asymetric cartesian?

2. If I try to build the product I described above in reducer, what
would be the best practice? My current code look like this:

 @Override
 public void reduce(final VarLongWritable key, final
IterableVarLongWritable values, final Context context)  {

 final VarLongWritable[] valueArray = Iterables.toArray(values,
VarLongWritable.class);

 for (int i = 0; i  valueArray.length; i++) {
 for (int j = i + 1; j  valueArray.length; j++) {
 context.write(new PairWritable(valueArray[i].get(),
valueArray[j].get()), customerPreferenceWritable);
 }
 }
 }

I don't feel quite right with this solution since I make a copy of
values in valueArray and believe that it will cost me
OoutOfMemoryExceptions with larger data sets.

thanks and br
reinis




Re: Best practice for partial cartesian product

2014-04-08 Thread Sebastian Schelter

Have a look at the sampleDown method in RowSimilarityJob:

https://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/RowSimilarityJob.java?view=markup

On 04/08/2014 10:33 AM, Reinis Vicups wrote:

Sebastian, thank your very much for your response.

Could you or anyone point me to the mahout classes where this is being
solved?

thank you guys
reinis

On 08.04.2014 10:27, Sebastian Schelter wrote:

I don't know a good name for that. The problems is that a quadratic
amount of pairs needs to be emitted here. In our collaborative
filtering code, we solve this through downsampling.

--sebastian

On 04/08/2014 10:08 AM, Reinis Vicups wrote:

Hi,

this is not mahout question directly, but I figured that you guys most
likely can answer it.

Actually I have two questions:

1. This: {(1,2); (1,3); (2,3)} is not full cartesian product, right? It
is missing (1,1); (2,2); (3,3); (2,1); My question is - how is it
called? Partial cartesian? Asymetric cartesian?

2. If I try to build the product I described above in reducer, what
would be the best practice? My current code look like this:

 @Override
 public void reduce(final VarLongWritable key, final
IterableVarLongWritable values, final Context context) {

 final VarLongWritable[] valueArray = Iterables.toArray(values,
VarLongWritable.class);

 for (int i = 0; i  valueArray.length; i++) {
 for (int j = i + 1; j  valueArray.length; j++) {
 context.write(new PairWritable(valueArray[i].get(),
valueArray[j].get()), customerPreferenceWritable);
 }
 }
 }

I don't feel quite right with this solution since I make a copy of
values in valueArray and believe that it will cost me
OoutOfMemoryExceptions with larger data sets.

thanks and br
reinis






Re: Can any one help

2014-04-08 Thread Sebastian Schelter

It seems there is a problem with your hdfs, how did you configure that?

--sebastian

On 04/08/2014 07:23 PM, Neetha wrote:

Hi,


I am trying to run Mahout -kmeans clustering on hadoop, but I am getting
this error,


hduser3@ubuntu:/usr/local/hadoop-1.0.1/mahout3$ bin/mahout seqdirectory \-i
mahout-work/reuters-out \-o mahout-work/reuters-out-seqdir \-c UTF-8 -chunk
5
Warning: $HADOOP_HOME is deprecated.


hduser3@ubuntu:/usr/local/hadoop-1.0.1/mahout3$ bin/mahout seqdirectory \-i
mahout-work/reuters-out \-o mahout-work/reuters-out-seqdir \-c UTF-8 -chunk
5
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.0.1/bin/hadoop and
HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/hadoop-1.0.1/mahout3/examples/target/
mahout-examples-0.7-job.jar
Warning: $HADOOP_HOME is deprecated.

14/04/07 12:10:14 INFO common.AbstractJob: Command line arguments:
{--charset=[UTF-8], --chunkSize=[5], --endPhase=[2147483647],
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
--input=[mahout-work/reuters-out], --keyPrefix=[],
--output=[mahout-work/reuters-out-seqdir], --startPhase=[0],
--tempDir=[temp]}
14/04/07 12:10:15 WARN hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/hduser3/mahout-work/reuters-out-seqdir/chunk-0 could only be
replicated to 0 nodes, instead of 1
 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
getAdditionalBlock(FSNamesystem.java:1556)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(
NameNode.java:696)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at org.apache.hadoop.security.UserGroupInformation.doAs(
UserGroupInformation.java:1093)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

 at org.apache.hadoop.ipc.Client.call(Client.java:1066)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at $Proxy1.addBlock(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(
RetryInvocationHandler.java:82)
 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(
RetryInvocationHandler.java:59)
 at $Proxy1.addBlock(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.
locateFollowingBlock(DFSClient.java:3507)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.
nextBlockOutputStream(DFSClient.java:3370)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.
access$2700(DFSClient.java:2586)
 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$
DataStreamer.run(DFSClient.java:2826)

14/04/07 12:10:15 WARN hdfs.DFSClient: Error Recovery for block null bad
datanode[0] nodes == null
14/04/07 12:10:15 WARN hdfs.DFSClient: Could not get block locations.
Source file /user/hduser3/mahout-work/reuters-out-seqdir/chunk-0 -
Aborting...
Apr 7, 2014 12:10:15 PM com.google.common.io.Closeables close
WARNING: IOException thrown while closing Closeable.
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/hduser3/mahout-work/reuters-out-seqdir/chunk-0 could only be
replicated to 0 nodes, instead of 1
 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
getAdditionalBlock(FSNamesystem.java:1556)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(
NameNode.java:696)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
 at org.apache.hadoop.security.UserGroupInformation.doAs(
UserGroupInformation.java:1093)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

 at 

Re: Solr+Mahout Recommender Demo Site

2014-04-06 Thread Sebastian Schelter

The top 3 recommendations based on videos you liked are very good!

Nice job.


On 04/06/2014 07:26 PM, Pat Ferrel wrote:

After having integrated several versions of the Mahout and Myrrix recommenders 
at fairly large scale. I was interested in solving three problems that these 
did not directly provide for:
1) realtime queries for recs using data not yet incorporated into the training 
set. Myrrix allows this but Mahout using the hadoop mr version does not.
2) cross-recommendations from two or more action types (say purchase and 
detail-view)
3) blending metadata and user preference data to return recs (for example category 
 user preferences = recs)

Using Solr + Mahout provided an amazingly flexible and performant way to do 
this. Ted wrote about his experience with this basic approach in his recent 
book. Take user preferences, run them through RowSimilarityJob and you get an 
item by item similarity Matrix. This is the core of an item-based cooccurrence 
recommender. If you take the similarity matrix, and convert it into a list of 
tokens per row, you have something Solr can index. If you then use a user’s 
history as a query on the indexed data you get an ordered list of 
recommendations.

When I set out to do #1 and #3 the need for CF data AND metadata was the first 
problem. So I mined the web for video reviews and video metadata. Then logging 
any users who visit the site will lead to data for #2 and #1.

The demo site is https://guide.finderbots.com and instructions are at the end 
of this for anyone who would like to test it out. As a crude user test there is 
a procedure we ask you to follow to help gather quality of recommendations 
data. It’s running out of my closet over Comcast so if it’s down I may have 
tripped over a cord, sorry try again later.

There are a bunch of different methods for making recs illustrated on the site. 
One method that illustrates blending metadata uses preference data from you, 
and metadata to bias and filter recs. Imagine that you have trained the system 
with your preferences by making some video picks. Now imagine you’d like to get 
recommendations for Comedies from Neflix based on your previous video 
preferences. This is done with a single Solr query on indexed video fields that 
hold genre, similar videos (from the similarity matrix), and sources. The query 
finds similar videos to the ones you have liked, with the genre “Comedy” 
boosted by some amount, but only those that have at least one source = 
“Netflix”.

I’ll be doing some blog posts covering the specifics of how each rec type is 
done, the site and DB architecture, and Solr setup.

The project uses the Solr recommender prep code here: 
https://github.com/pferrel/solr-recommender

BTW I plan to publish obfuscated usage data in the github repo.

begin form letter ===

Please use a very newly updated browser (latest Firefox, Chrome, Safari, and 
nothing older than IE10) the site doesn’t yet check browser compatibility but 
relies on HTML5 and CSS3 rather heavily.

1) go to https://guide.finderbots.com/users/sign_up to create an account
2) go to https://guide.finderbots.com/trainers to ’train' the recommender hit 
thumbs up on videos you like. There are 20 pages of training videos, you can 
leave at any time but if you can go through them all it would be appreciated.
3) go to https://guide.finderbots.com/guides/recommend to immediately get 
personalized recs from your training data. If you completed the trainer check 
the top line of recs, count how many are videos you liked or would like to see. 
Scroll right or left to see a total of 24 in four batches of 6. If you could 
report to me the total you thought were good recs it would be greatly 
appreciated.
4) browse videos by various criteria here: https://guide.finderbots.com/guides 
These are not recommendations, they are simply a catalog.
5) control how you browse videos by clicking the gears icon. You can set all 
videos to be from one or more sources here. If you choose Netflix alone (don’t 
forget to uncheck ‘all’) then recs and browsed videos will all be available on 
Netflix.







Re: Number of features for ALS

2014-03-30 Thread Sebastian Schelter
Use k-fold cross-validation or hold-out tests for estimating the quality 
of different parameter combinations.


--sebastian

On 03/30/2014 11:53 AM, Niklas Ekvall wrote:

Hi,

My name is Niklas Ekvall and I have a implementation of the recommender
algorithm Large-scale Parallel Collaborative Filtering for the Netflix
Prize and now I'm wondering how to choose the number of features and
lambda. Could any of guys help me to explain a stepwise strategy to choose
or optimize these two parameters?

Best regards, Niklas


2014-03-27 19:07 GMT+01:00 j.barrett Strausser 
j.barrett.straus...@gmail.com:


Thanks Ted,

Yes for the time problem. We tend to use aggregations of session data. So
instead of asking for user recommendations we do things like user+sessions
recommendations.

Of course, deciding when sessions start and stop isn't trivial. I ideally
what I would want to is time-weight views using a kernel or convolution.
That's a bit heavy so we typically have a global model, that is is
basically all preferences over times. Then these user+session type models.
We can then combine these at another level to give recommendations based on
what you like throughout time versus what you have been doing recently.



-b


On Thu, Mar 27, 2014 at 1:59 PM, Ted Dunning ted.dunn...@gmail.com
wrote:


For the poly-syllable challenged,

hetereoscedasticity - degree of variation changes.  This is common with
counts because you expect the standard deviation of count data to be
proportional to sqrt(n).

time imhogeneity - changes in behavior over time.  One way to handle this
(roughly) is to first remove variation in personal and item means over

time

(if using ratings) and then to segment user histories into episodes.  By
including both short and long episodes you get some repair for changes in
personal preference.  A great example of how this works/breaks is

Christmas

music.  On December 26th, you want to *stop* recommending this music so

it

really pays to limit histories at this point.  By having an episodic user
session that starts around November and runs to Christmas, you can get

good

recommendations for seasonal songs and not pollute the rest of the
universe.



On Thu, Mar 27, 2014 at 8:30 AM, j.barrett Strausser 
j.barrett.straus...@gmail.com wrote:


For my team it has usually been hetereoscedasticity and time

inhomogeneity.





On Thu, Mar 27, 2014 at 10:18 AM, Tevfik Aytekin
tevfik.ayte...@gmail.comwrote:


Interesting topic,
Ted, can you give examples of those mathematical assumptions
under-pinning ALS which are violated by the real world?

On Thu, Mar 27, 2014 at 3:43 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

How can there be any other practical method?  Essentially all of

the

mathematical assumptions under-pinning ALS are violated by the real

world.

  Why would any mathematical consideration of the number of features

be

much

more than heuristic?

That said, you can make an information content argument.  You can

also

make

the argument that if you take too many features, it doesn't much

hurt

so

you should always take as many as you can compute.



On Thu, Mar 27, 2014 at 6:33 AM, Sebastian Schelter 

s...@apache.org

wrote:



Hi,

does anyone know of a principled approach of choosing the number

of

features for ALS (other than cross-validation?)

--sebastian







--


https://github.com/bearrito
@deepbearrito







--


https://github.com/bearrito
@deepbearrito







Re: (help!) Can someone scan this

2014-03-29 Thread Sebastian Schelter

Jay,

which version of Mahout are you using? Have you tried to explicitly set 
the temp path?


--sebastian

On 03/29/2014 01:52 AM, Jay Vyas wrote:

Hi again mahout:

Im wrapping a distributed recommender like this:

https://raw.githubusercontent.com/jayunit100/bigpetstore/master/src/main/java/org/bigtop/bigpetstore/clustering/BPSRecommnder.java

And its not working.

Any thoguhts on why?  The error message is simply that intermediate data
sets dont exist (i.e. numUsers.bin or /tmp/preparePreferencesMatrix...).

Basically its clear that the intermediate jobs are failing but i cant see
any reason why they would fail And I don't see any meaningfull stack
traces.

I've found alot of good whitepapers and stuff on how the algorithms work ,
but its not clear what is really done for me by mahout, and what i have to
do on my own for the distributed recommender APIs.





Re: The 3 distributed recommenders

2014-03-28 Thread Sebastian Schelter

Hi Jay,

there's not much documentation unfortunately. We're in the process of 
creating that however. We removed the pseudo-distributed recommender, 
mainly because nobody ever used it. There are two research papers that 
could help you with understanding the other two distributed recommenders:


For ALS:

Distributed Matrix Factorization with MapReduce using a series of 
Broadcast-Joins, RecSys'13


http://ssc.io/wp-content/uploads/2011/12/sys024-schelter.pdf

For item-based:

Scalable Similarity-Based Neighborhood Methods with MapReduce, RecSys'12

http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf


On 03/28/2014 02:04 PM, Jay Vyas wrote:

Hi mahout:

Looking through the source code there are 3 distributed recommenders...

the als recommender
the item recommender
the pseudo recommender

Any docs differentiating these?





Number of features for ALS

2014-03-27 Thread Sebastian Schelter

Hi,

does anyone know of a principled approach of choosing the number of 
features for ALS (other than cross-validation?)


--sebastian


Re: Does Recommender System Overview Demo work?

2014-03-24 Thread Sebastian Schelter

Hi Bhargav,

you are right, the content on the page is outdated and contains some 
errors. I've created a jira ticket to fix this [1].


Thank you for reporting the problem!

[1] https://issues.apache.org/jira/browse/MAHOUT-1485


On 03/24/2014 04:41 AM, Bhargav Golla wrote:

Hi

I was wondering if the demo existing at
https://mahout.apache.org/users/recommender/recommender-documentation.htmlstill
works. I don't find webapp directory in integration/ and hence even
after I add jetty plugin in the pom.xml in integration/, it is throwing an
exception.

Bhargav Golla
Committer, ASF
Github http://www.github.com/bhargavgolla |
LinkedINhttp://www.linkedin.com/in/bhargavgolla
  | Website http://www.bhargavgolla.com/





Re: Does Recommender System Overview Demo work?

2014-03-24 Thread Sebastian Schelter
The webapp in Mahout does not offer much functionality. If you'd like to 
use Mahout via a webinterface, I suggest you either use predictionIO [1] 
or kornakapi [2].


Best,
Sebastian


[1} http://prediction.io
[2] http://ssc.io/a-recommendation-webservice-in-10-minutes/

On 03/24/2014 02:29 PM, Bhargav Golla wrote:

Hi Sebastian

Thanks for letting me know. I was wondering if it was removed only in 0.9
version. Can I check the 0.8 branch and use the webapp in that branch?

Bhargav Golla
Developer. Freelancer.
Github http://www.github.com/bhargavgolla |
LinkedINhttp://www.linkedin.com/in/bhargavgolla
  | Website http://www.bhargavgolla.com/


On Mon, Mar 24, 2014 at 2:12 AM, Sebastian Schelter s...@apache.org wrote:


Hi Bhargav,

you are right, the content on the page is outdated and contains some
errors. I've created a jira ticket to fix this [1].

Thank you for reporting the problem!

[1] https://issues.apache.org/jira/browse/MAHOUT-1485



On 03/24/2014 04:41 AM, Bhargav Golla wrote:


Hi

I was wondering if the demo existing at
https://mahout.apache.org/users/recommender/recommender-
documentation.htmlstill
works. I don't find webapp directory in integration/ and hence even
after I add jetty plugin in the pom.xml in integration/, it is throwing an
exception.

Bhargav Golla
Committer, ASF
Github http://www.github.com/bhargavgolla |
LinkedINhttp://www.linkedin.com/in/bhargavgolla
   | Website http://www.bhargavgolla.com/










Re: Does Recommender System Overview Demo work?

2014-03-24 Thread Sebastian Schelter

Would be great to have such an overview on the mahout website.

On 03/24/2014 03:18 PM, Jay Vyas wrote:

I've tried to start disambiguating the difference between mahout
distributed vs local tutorials here, because ive found it causes problems
for a lot of people (including me)

http://jayunit100.blogspot.com/2014/02/a-few-nice-posts-about-distirbuted.html

anyone want to collaborate on a two table wiki page which links to
tutorials about distributed vs single node implementations of all
algorithms?


On Mon, Mar 24, 2014 at 10:00 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:


It was removed in 0.9 and am not sure if it was there in 0.8. I vaguely
remember removing it in 0.9 based on a conversation with Manuel on user@.
Manuel, if u could chime in here.





On Monday, March 24, 2014 9:44 AM, Sebastian Schelter s...@apache.org
wrote:

The webapp in Mahout does not offer much functionality. If you'd like to
use Mahout via a webinterface, I suggest you either use predictionIO [1]
or kornakapi [2].

Best,
Sebastian


[1} http://prediction.io
[2] http://ssc.io/a-recommendation-webservice-in-10-minutes/


On 03/24/2014 02:29 PM, Bhargav Golla wrote:

Hi Sebastian

Thanks for letting me know. I was wondering if it was removed only in 0.9
version. Can I check the 0.8 branch and use the webapp in that branch?

Bhargav Golla
Developer. Freelancer.
Github http://www.github.com/bhargavgolla |
LinkedINhttp://www.linkedin.com/in/bhargavgolla
   | Website http://www.bhargavgolla.com/


On Mon, Mar 24, 2014 at 2:12 AM, Sebastian Schelter s...@apache.org

wrote:



Hi Bhargav,

you are right, the content on the page is outdated and contains some
errors. I've created a jira ticket to fix this [1].

Thank you for reporting the problem!

[1] https://issues.apache.org/jira/browse/MAHOUT-1485



On 03/24/2014 04:41 AM, Bhargav Golla wrote:


Hi

I was wondering if the demo existing at
https://mahout.apache.org/users/recommender/recommender-
documentation.htmlstill
works. I don't find webapp directory in integration/ and hence even
after I add jetty plugin in the pom.xml in integration/, it is

throwing an

exception.

Bhargav Golla
Committer, ASF
Github http://www.github.com/bhargavgolla |
LinkedINhttp://www.linkedin.com/in/bhargavgolla
| Website http://www.bhargavgolla.com/
















Re: Problem with K-Means clustering on Amazon EMR

2014-03-23 Thread Sebastian Schelter

Hi Konstantin,

Great to see that you located the error. Could you open a jira issue and 
submit a patch that contains an updated error message?


Thank you,
Sebastian

On 03/23/2014 02:57 PM, Konstantin Slisenko wrote:

Hi!

I investigated the situation. RandomSeedGenerator (
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java?av=f)
has following code:

FileSystem fs = FileSystem.get(output.toUri(), conf);

...

fs.getFileStatus(input).isDir()

FileSystem object was created from output path, which was not specified
correctly by me. (I didn't use prefix s3:// for this path). Afterwards
getFileStatus has parameter for input path, which was correct. This caused
misunderstanding.

To prevent this misunderstanding, I propose to improve error message adding
following details:
1. Specify which filesystem type used (DistributedFileSystem,
NativeS3FileSystem, etc. using fs.getClass().getName())
2. Then specify which path can not be processed correctly.

This can be done by validation utility which can be applied to many places
in Mahout. When we use Mahout we need to specify many paths and we also can
use many types of file systems: local for debugging, distributed on Hadoop,
and s3 on Amazon. In this case better error messages can save much time. I
think that refactoring is not needed for this case.

2014-03-16 22:19 GMT+03:00 Jay Vyas jayunit...@gmail.com:


I agree best to be explicit when creating filesystem instances by using
the two argument get(...). it's time to update it filesystem 2.0 Apis.  Can
you file a Jira for this ?  If not I will :)


On Mar 16, 2014, at 12:37 PM, Sebastian Schelter s...@apache.org wrote:

I've also encountered a similar error once. It's really just the

FileSystem.get call that needs to be modified. I think its a good idea to
walk through the codebase and refactor this where necessary.


--sebastian



On 03/16/2014 05:16 PM, Andrew Musselman wrote:
Another wild guess, I've had issues trying to use the 's3' protocol

from Hadoop and got things working by using the 's3n' protocol instead.



On Mar 16, 2014, at 8:41 AM, Jay Vyas jayunit...@gmail.com wrote:

I specifically have fixed mapreduce jobs by doing what the error

message suggests.


But maybe (hopefully) there is another workaround that is

configuration driven.


Just a hunch but, Maybe mahout needs to be refactored to create fs

objects using the get(uri,conf) calls?


As hadoop evolves to support different flavored of hcfs probably using

API calls that are more flexible (i.e. Like the fs.get(uri,conf) one), will
probably be a good thing to keep in mind.



On Mar 16, 2014, at 9:22 AM, Frank Scholten fr...@frankscholten.nl

wrote:


Hi Konstantin,

Good to hear from you.

The link you mentioned points to EigenSeedGenerator not
RandomSeedGenerator. The problem seems to be with the call to

fs.getFileStatus(input).isDir()


It's been a while and I don't remember but perhaps you have to set
additional Hadoop fs properties to use S3. See
https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the

cause of

this by creating a small Java main app with that line of code and run

it in

the debugger.

Cheers,

Frank



On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko
kslise...@gmail.comwrote:


Hello!

I run a text-documents clustering on Hadoop cluster in Amazon

Elastic Map

Reduce. As input and output I use S3 Amazon file system. I specify

all

paths as s3://bucket-name/folder-name.

SparceVectorsFromSequenceFile works correctly with S3
but when I start K-Means clustering job, I get this error:

Exception in thread main java.lang.IllegalArgumentException: This
file system object (hdfs://172.31.41.65:9000) does not support

access

to the request path



's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors'

You possibly called FileSystem.get(conf) when you should have called
FileSystem.get(uri, conf) to obtain a file system supporting your
path.

   at

org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)

   at


org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)

   at


org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)

   at


org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)

   at


org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76)

   at


org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93)

   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at


bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121)

   at


bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause

of this a
   at


bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41

Documentation, documentation, documentation

2014-03-22 Thread Sebastian Schelter

Hi,

It's great to see a lot of work being spent on cleaning up the website. 
I think we have already done a great job here, but there are still a few 
more pages that need work.


I created a jira issue for every single page that needs some work, would 
be awesome if we could find enough volunteers to finish this quickly.


If you wanna take a ticket, write a comment that you start work on it, 
go through the website, check it for dead links and formatting errors 
and try out examples that are listed with the current release to see if 
everything still works. Either attach a textfile containing a new 
version of the page to the issue or add a comment on the issue that 
details the fix that you want to see (e.g. remove link ... because it 
is dead.)


Here's an overview of the tickets:

MAHOUT-1471 Clean up website on Canopy Clustering
MAHOUT-1472 Clean up website on Fuzzy k-Means
MAHOUT-1473 Clean up website on Spectral Clustering
MAHOUT-1474 Add Seinfeld clustering example

MAHOUT-1475 Clean up website on Naive Bayes
MAHOUT-1476 Clean up website on Hidden Markov Models
MAHOUT-1477 Clean up website on Logistic Regression
MAHOUT-1478 Clean up website on Random Forests
MAHOUT-1479 Clean up website on wikipedia example
MAHOUT-1480 Clean up website on 20 newsgroups
MAHOUT-1481 Clean up website on breiman example

MAHOUT-1482 Rework quickstart website

I would kindly ask Shannon to take 1473, Frank 1474 and Frank or Ted 1477.

Let's quickly finish the work on documenting what we have, so we can 
move on to new and exciting developments in Mahout!


--sebastian




Re: Documentation, documentation, documentation

2014-03-22 Thread Sebastian Schelter

Sry, I seem to have overlooked this.
Could you move the cleanings of canopy to 1471?

Thank you.

On 03/22/2014 04:54 PM, Pavan Kumar N wrote:

i have already added canopy vlustering cleansing as part of jira 1450 ..
also created new issue for adding streaming kmeans .
On Mar 22, 2014 8:37 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

It's great to see a lot of work being spent on cleaning up the website. I
think we have already done a great job here, but there are still a few more
pages that need work.

I created a jira issue for every single page that needs some work, would
be awesome if we could find enough volunteers to finish this quickly.

If you wanna take a ticket, write a comment that you start work on it, go
through the website, check it for dead links and formatting errors and try
out examples that are listed with the current release to see if everything
still works. Either attach a textfile containing a new version of the page
to the issue or add a comment on the issue that details the fix that you
want to see (e.g. remove link ... because it is dead.)

Here's an overview of the tickets:

MAHOUT-1471 Clean up website on Canopy Clustering
MAHOUT-1472 Clean up website on Fuzzy k-Means
MAHOUT-1473 Clean up website on Spectral Clustering
MAHOUT-1474 Add Seinfeld clustering example

MAHOUT-1475 Clean up website on Naive Bayes
MAHOUT-1476 Clean up website on Hidden Markov Models
MAHOUT-1477 Clean up website on Logistic Regression
MAHOUT-1478 Clean up website on Random Forests
MAHOUT-1479 Clean up website on wikipedia example
MAHOUT-1480 Clean up website on 20 newsgroups
MAHOUT-1481 Clean up website on breiman example

MAHOUT-1482 Rework quickstart website

I would kindly ask Shannon to take 1473, Frank 1474 and Frank or Ted 1477.

Let's quickly finish the work on documenting what we have, so we can move
on to new and exciting developments in Mahout!

--sebastian









Re: Problem with K-Means clustering on Amazon EMR

2014-03-16 Thread Sebastian Schelter
I've also encountered a similar error once. It's really just the 
FileSystem.get call that needs to be modified. I think its a good idea 
to walk through the codebase and refactor this where necessary.


--sebastian


On 03/16/2014 05:16 PM, Andrew Musselman wrote:

Another wild guess, I've had issues trying to use the 's3' protocol from Hadoop 
and got things working by using the 's3n' protocol instead.


On Mar 16, 2014, at 8:41 AM, Jay Vyas jayunit...@gmail.com wrote:

I specifically have fixed mapreduce jobs by doing what the error message 
suggests.

But maybe (hopefully) there is another workaround that is configuration driven.

Just a hunch but, Maybe mahout needs to be refactored to create fs objects 
using the get(uri,conf) calls?

As hadoop evolves to support different flavored of hcfs probably using API 
calls that are more flexible (i.e. Like the fs.get(uri,conf) one), will 
probably be a good thing to keep in mind.


On Mar 16, 2014, at 9:22 AM, Frank Scholten fr...@frankscholten.nl wrote:

Hi Konstantin,

Good to hear from you.

The link you mentioned points to EigenSeedGenerator not
RandomSeedGenerator. The problem seems to be with the call to

fs.getFileStatus(input).isDir()


It's been a while and I don't remember but perhaps you have to set
additional Hadoop fs properties to use S3. See
https://wiki.apache.org/hadoop/AmazonS3. Perhaps you isolate the cause of
this by creating a small Java main app with that line of code and run it in
the debugger.

Cheers,

Frank



On Sun, Mar 16, 2014 at 12:07 PM, Konstantin Slisenko
kslise...@gmail.comwrote:


Hello!

I run a text-documents clustering on Hadoop cluster in Amazon Elastic Map
Reduce. As input and output I use S3 Amazon file system. I specify all
paths as s3://bucket-name/folder-name.

SparceVectorsFromSequenceFile works correctly with S3
but when I start K-Means clustering job, I get this error:

Exception in thread main java.lang.IllegalArgumentException: This
file system object (hdfs://172.31.41.65:9000) does not support access
to the request path

's3://by.kslisenko.bigdata/stackovweflow-small/out_new/sparse/tfidf-vectors'
You possibly called FileSystem.get(conf) when you should have called
FileSystem.get(uri, conf) to obtain a file system supporting your
path.

   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:375)
   at
org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:106)
   at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:162)
   at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:530)
   at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:76)
   at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:93)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at
bbuzz2011.stackoverflow.runner.RunnerWithInParams.cluster(RunnerWithInParams.java:121)
   at
bbuzz2011.stackoverflow.runner.RunnerWithInParams.run(RunnerWithInParams.java:52)cause
of this a
   at
bbuzz2011.stackoverflow.runner.RunnerWithInParams.main(RunnerWithInParams.java:41)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


I checked RandomSeedGenerator.buildRandom
(
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.8/org/apache/mahout/clustering/kmeans/EigenSeedGenerator.java?av=f
)
and I assume it has correct code:

FileSystem fs = FileSystem.get(output.toUri(), conf);


I can not run clustering because of this error. May be you have any
ideas how to fix this?





Re: Compiling Mahout with maven in Eclipse

2014-03-13 Thread Sebastian Schelter

Maven should generate the classes automatically. Have you tried running

mvn -DskipTests clean install

on the commandline?



On 03/13/2014 09:50 AM, Kevin Moulart wrote:

How can I generate them to make these errors go away then ? Or don't I have
to ?

Kévin Moulart


2014-03-13 9:17 GMT+01:00 Sebastian Schelter ssc.o...@googlemail.com:


Those are autogenerated.


On 03/13/2014 09:05 AM, Kevin Moulart wrote:


Ok it does compile with maven in eclipse as well, but still, many imports
are not recognized in the sources :

- import org.apache.mahout.math.function.IntObjectProcedure;
- import org.apache.mahout.math.map.OpenIntLongHashMap;
- import org.apache.mahout.math.map.OpenIntObjectHashMap;
- import org.apache.mahout.math.set.OpenIntHashSet;
- import org.apache.mahout.math.list.DoubleArrayList;
...

Pretty much all the problems come from the OpenInt... classes that it
doesn't seem to find. Is there a jar or a pom entry I need to add here ?
Or do I have the wrong version of org.apache.mahout.math, because I can't
find those maps/sets/lists in the math package ?

(I have the same problem on both my windows, centos and mac os)

Kévin Moulart


2014-03-12 17:00 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com:

  Never mind, I found where the problem lied, I deleted the full content of

.m2 and retried it as non root user and it worked. Trying in Eclipse now,
with tests I'll let you now if it doesn't work.

Kévin Moulart


2014-03-12 16:45 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com:

Hi,



I tried to fix all the problem I had to configure eclipse in order to
compile mahout in it using maven clean package as goal.

First I had to make a change in mahout core in the class GroupTree.java,
line 171 :

  stack = new ArrayDequeGroupTree();





Then I tried compiling with eclipse (I already had the plugin and all
imported and I'm working on the trunk version).

  From eclipse it runs until it tries compiling the examples :

  [INFO] Building jar:

/home/myCompany/Workspace_eclipse/mahout-trunk/examples/
target/mahout-examples-1.0-SNAPSHOT-job.jar
[INFO]


[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools  SUCCESS [
   1.173 s]
[INFO] Apache Mahout . SUCCESS [
   0.307 s]
[INFO] Mahout Math ... SUCCESS [
   8.041 s]
[INFO] Mahout Core ... SUCCESS [
   8.378 s]
[INFO] Mahout Integration  SUCCESS [
   1.030 s]
[INFO] Mahout Examples ... FAILURE [
   5.325 s]
[INFO] Mahout Release Package  SKIPPED
[INFO] Mahout Math/Scala wrappers  SKIPPED
[INFO] Mahout Spark bindings . SKIPPED
[INFO]


[INFO] BUILD FAILURE
[INFO]


[INFO] Total time: 24.630 s
[INFO] Finished at: 2014-03-12T16:38:08+01:00
[INFO] Final Memory: 101M/1430M
[INFO]


[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-assembly-plugin:2.4:single (job) on
project
mahout-examples: Failed to create assembly: Error creating assembly
archive
job: IOException when zipping com/ibm/icu/ICUConfig.properties:
invalid LOC
header (bad signature) - [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with
the
-e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/
MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with
the
command
[ERROR]   mvn goals -rf :mahout-examples




It does the exact same thing when I try typing mvn clean package in
terminal, but when I try it as root, it works, so it might be an issue
with
the permissions however I fail to see where (I did a chown -R on my
entire
home folder just to be on the safe side and it still fails).

Anyone had the same problem ? Any idea about how to fix it ?

Kévin Moulart















Re: Compiling Mahout with maven in Eclipse

2014-03-13 Thread Sebastian Schelter

Are executing maven in the topmost directory?

On 03/13/2014 10:09 AM, Kevin Moulart wrote:

I did, but then it fails because of these missing files :
https://gist.github.com/kmoulart/9524828

Kévin Moulart


2014-03-13 9:57 GMT+01:00 Sebastian Schelter s...@apache.org:


Maven should generate the classes automatically. Have you tried running

mvn -DskipTests clean install

on the commandline?




On 03/13/2014 09:50 AM, Kevin Moulart wrote:


How can I generate them to make these errors go away then ? Or don't I
have
to ?

Kévin Moulart


2014-03-13 9:17 GMT+01:00 Sebastian Schelter ssc.o...@googlemail.com:

  Those are autogenerated.



On 03/13/2014 09:05 AM, Kevin Moulart wrote:

  Ok it does compile with maven in eclipse as well, but still, many

imports
are not recognized in the sources :

- import org.apache.mahout.math.function.IntObjectProcedure;
- import org.apache.mahout.math.map.OpenIntLongHashMap;
- import org.apache.mahout.math.map.OpenIntObjectHashMap;
- import org.apache.mahout.math.set.OpenIntHashSet;
- import org.apache.mahout.math.list.DoubleArrayList;
...

Pretty much all the problems come from the OpenInt... classes that it
doesn't seem to find. Is there a jar or a pom entry I need to add here ?
Or do I have the wrong version of org.apache.mahout.math, because I
can't
find those maps/sets/lists in the math package ?

(I have the same problem on both my windows, centos and mac os)

Kévin Moulart


2014-03-12 17:00 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com:

   Never mind, I found where the problem lied, I deleted the full
content of


.m2 and retried it as non root user and it worked. Trying in Eclipse
now,
with tests I'll let you now if it doesn't work.

Kévin Moulart


2014-03-12 16:45 GMT+01:00 Kevin Moulart kevinmoul...@gmail.com:

Hi,



I tried to fix all the problem I had to configure eclipse in order to
compile mahout in it using maven clean package as goal.

First I had to make a change in mahout core in the class
GroupTree.java,
line 171 :

   stack = new ArrayDequeGroupTree();






Then I tried compiling with eclipse (I already had the plugin and all
imported and I'm working on the trunk version).

   From eclipse it runs until it tries compiling the examples :

   [INFO] Building jar:


/home/myCompany/Workspace_eclipse/mahout-trunk/examples/
target/mahout-examples-1.0-SNAPSHOT-job.jar
[INFO]


[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools  SUCCESS [
1.173 s]
[INFO] Apache Mahout . SUCCESS [
0.307 s]
[INFO] Mahout Math ... SUCCESS [
8.041 s]
[INFO] Mahout Core ... SUCCESS [
8.378 s]
[INFO] Mahout Integration  SUCCESS [
1.030 s]
[INFO] Mahout Examples ... FAILURE [
5.325 s]
[INFO] Mahout Release Package  SKIPPED
[INFO] Mahout Math/Scala wrappers  SKIPPED
[INFO] Mahout Spark bindings . SKIPPED
[INFO]


[INFO] BUILD FAILURE
[INFO]


[INFO] Total time: 24.630 s
[INFO] Finished at: 2014-03-12T16:38:08+01:00
[INFO] Final Memory: 101M/1430M
[INFO]


[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-assembly-plugin:2.4:single (job) on
project
mahout-examples: Failed to create assembly: Error creating assembly
archive
job: IOException when zipping com/ibm/icu/ICUConfig.properties:
invalid LOC
header (bad signature) - [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with
the
-e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug
logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/
MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with
the
command
[ERROR]   mvn goals -rf :mahout-examples




It does the exact same thing when I try typing mvn clean package in
terminal, but when I try it as root, it works, so it might be an issue
with
the permissions however I fail to see where (I did a chown -R on my
entire
home folder just to be on the safe side and it still fails).

Anyone had the same problem ? Any idea about how to fix it ?

Kévin Moulart




















Re: verbose output

2014-03-13 Thread Sebastian Schelter
To my knowledge, there is no such flag for mahout. You can check 
hadoop's logs for further information however.


On 03/13/2014 10:21 AM, Mahmood Naderan wrote:

Hi,
Is there any verbosity flag for hadoop and mahout commands? I can not find such 
thing in the command line.


Regards,
Mahmood





Re: Website, urgent help needed

2014-03-13 Thread Sebastian Schelter

Hi Scott,

Create a jira ticket and attach your scripts and a text version of the 
page there.


Best,
Sebastian


On 03/12/2014 03:27 PM, Scott C. Cote wrote:

I took the tour of the text analysis and pushed through despite the
problems on the page.  Commiters helped me over the hump where others
might have just gave up (to your point).
When I did it, I made shell scripts so that my steps would be repeatable
with an anticipation of updating the page.

Unforunately, I gave up on trying to figure out how to update the page
(there were links indicating that I could do it), and I didn¹t want to
appear to be stupid asking how to update the documentation (my bad - not
anyone else).  Now I know that it was not possible unless I was a commiter.

Who should I send my scripts to, or how should I proceed with a current
form of the page?

SCott

On 3/12/14, 5:02 AM, Sebastian Schelter s...@apache.org wrote:


Hi Pavan,

Awesome that you're willing to help. The documentation are the pages
listed under Clustering in the navigation bar under mahout.apache.org

If you start working on one of the pages listed there (e.g. the k-Means
doc), please created jira ticket in our issue tracker with a title along
the lines of Cleaning up the documentation for k-Means on the website.

Put a list of errors and corrections into the jira and I (or some other
committer) will make sure to fix the website.

Thanks,
Sebastian


On 03/12/2014 08:48 AM, Pavan Kumar N wrote:

i ll help with clustering algorithms documentation. do send me old
documentation and i will check and remove errors.  or better let me know
how to proceed.

Pavan
On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

As you've probably noticed, I've put in a lot of effort over the last
days
to kickstart cleaning up our website. I've thrown out a lot of stuff
and
have been startled by the amout of outdated and incorrect information
on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for
new
people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't
have
the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification
and
Clustering on the website. For the algorithms, the content and
claims of
the articles need to be checked, for the examples we need to make sure
that
everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour
for
that in the evening. She will go to our website, download Mahout, read
the
description of an algorithm and try to run an example. In the current
state
of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.












Re: Problem with FileSystem in Kmeans

2014-03-12 Thread Sebastian Schelter

Hi Bikash,

Have you tried adding hdfs:// to your input path? Maybe that helps.

--sebastian

On 03/11/2014 11:22 AM, Bikash Gupta wrote:

Hi,

I am running Kmeans in cluster where I am setting the configuration of
fs.hdfs.impl and fs.file.impl before hand as mentioned below

conf.set(fs.hdfs.impl,org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set(fs.file.impl,org.apache.hadoop.fs.LocalFileSystem.class.getName());

Problem is that cluster-0 directory is getting created in local file system
and cluster-1 is getting created in HDFS, and Kmeans map reduce job is
unable to find cluster-0 . Please see below the stacktrace

2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments:
{--clustering=null, --clusters=[/3/clusters-0-final],
--convergenceDelta=[0.1],
--distanceMeasure=[org.apache.mahout.common.distance.EuclideanDistanceMeasure],
--endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100],
--method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0],
--tempDir=[temp]}
2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence
Clusters In: /3/clusters-0-final Out: /5
2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max
Iterations: 100
2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths to
process : 3
2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job:
job_201403111332_0011
2014-03-11 14:52:20 o.a.h.m.JobClient [INFO]  map 0% reduce 0%
2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id :
attempt_201403111332_0011_m_00_0, Status : FAILED
2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException:
/5/clusters-0
 at
org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.java:78)
 at
org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:208)
 at
org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
 at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.io.FileNotFoundException: File /5/clusters-0

Please suggest!!!






Website, urgent help needed

2014-03-12 Thread Sebastian Schelter

Hi,

As you've probably noticed, I've put in a lot of effort over the last 
days to kickstart cleaning up our website. I've thrown out a lot of 
stuff and have been startled by the amout of outdated and incorrect 
information on our website, as well as links pointing to nowhere.


I think our lack of documentation makes it superhard to use Mahout for 
new people. A crucial next step is to clean up the documentation on 
classification and clustering. I cannot do this alone, because I don't 
have the time and I'm not so familiar with the background of the algorithms.


I need volunteers to go through all the pages under Classification and 
Clustering on the website. For the algorithms, the content and claims 
of the articles need to be checked, for the examples we need to make 
sure that everything still works as described. It would also be great to 
move articles from personal blogs to our website.


Imagine that some developer wants to try out Mahout and takes one hour 
for that in the evening. She will go to our website, download Mahout, 
read the description of an algorithm and try to run an example. In the 
current state of the documentation, I'm afraid that most people will 
walk away frustrated, because the website does not help them as it should.


Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release 
depend on whether we manage to clean up and maintain our documentation.


Re: Website, urgent help needed

2014-03-12 Thread Sebastian Schelter
We don't exactly have that page, but we have pages that touch parts of 
it, such as 
https://mahout.apache.org/users/basics/creating-vectors-from-text.html


It would be great if you could create a jira ticket which lists the 
errors. I'll fix them then.


Best,
Sebastian

On 03/12/2014 08:42 AM, Juan José Ramos wrote:

Hi Sebastian,
I am afraid I am only familiar with the recommendation part.

In previous posts, I pointed a couple of errors in this wiki page:
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line

If you are planning to keep it in the new web, I can help pointing them out
again.

Thanks a lot for your effort.


On Wed, Mar 12, 2014 at 7:03 AM, Sebastian Schelter s...@apache.org wrote:


Hi,

As you've probably noticed, I've put in a lot of effort over the last days
to kickstart cleaning up our website. I've thrown out a lot of stuff and
have been startled by the amout of outdated and incorrect information on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for new
people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't have
the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification and
Clustering on the website. For the algorithms, the content and claims of
the articles need to be checked, for the examples we need to make sure that
everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour for
that in the evening. She will go to our website, download Mahout, read the
description of an algorithm and try to run an example. In the current state
of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.







Re: Website, urgent help needed

2014-03-12 Thread Sebastian Schelter

Hi Pavan,

Awesome that you're willing to help. The documentation are the pages 
listed under Clustering in the navigation bar under mahout.apache.org


If you start working on one of the pages listed there (e.g. the k-Means 
doc), please created jira ticket in our issue tracker with a title along 
the lines of Cleaning up the documentation for k-Means on the website.


Put a list of errors and corrections into the jira and I (or some other 
committer) will make sure to fix the website.


Thanks,
Sebastian


On 03/12/2014 08:48 AM, Pavan Kumar N wrote:

i ll help with clustering algorithms documentation. do send me old
documentation and i will check and remove errors.  or better let me know
how to proceed.

Pavan
On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

As you've probably noticed, I've put in a lot of effort over the last days
to kickstart cleaning up our website. I've thrown out a lot of stuff and
have been startled by the amout of outdated and incorrect information on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for new
people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't have
the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification and
Clustering on the website. For the algorithms, the content and claims of
the articles need to be checked, for the examples we need to make sure that
everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour for
that in the evening. She will go to our website, download Mahout, read the
description of an algorithm and try to run an example. In the current state
of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.







Re: Website, urgent help needed

2014-03-12 Thread Sebastian Schelter

Hi Manoj,

Awesome that you're willing to help.

I suggest we proceed analogously to the clustering cleanup:

The documentation are the pages listed under Classification in the 
navigation bar under mahout.apache.org


If you start working on one of the pages listed there (e.g. the Naive 
Bayes doc), please created jira ticket in our issue tracker with a title 
along the lines of Cleaning up the documentation for Naive Bayes on the 
website.


Put a list of errors and corrections into the jira and I (or some other 
committer) will make sure to fix the website.


Best,
Sebastian

On 03/12/2014 09:05 AM, Manoj Awasthi wrote:

Thanks Sebastian to you and others for effort in cleaning up the website
interface. It looks much better (fonts  layout) and much more usable if I
may say.

I will be happy to volunteer for the pages under classification in whatever
ways I can. I would want to contribute specially on verifying that the
examples provided work in the form they exist on the website and will be
happy to do any corrections wherever possible.

If there is initial backlog list which provides tasks at a granular level
then it will be great OR I can start looking on the page myself.

Manoj



On Wed, Mar 12, 2014 at 12:33 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

As you've probably noticed, I've put in a lot of effort over the last days
to kickstart cleaning up our website. I've thrown out a lot of stuff and
have been startled by the amout of outdated and incorrect information on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for new
people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't have
the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification and
Clustering on the website. For the algorithms, the content and claims of
the articles need to be checked, for the examples we need to make sure that
everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour for
that in the evening. She will go to our website, download Mahout, read the
description of an algorithm and try to run an example. In the current state
of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.







Re: Website, urgent help needed

2014-03-12 Thread Sebastian Schelter

Hi Kevin,

Thank you for offer to help! Feel free to ask questions here how to 
setup the sources in Eclipse. If you succeed, you could writeup what you 
did and we could add this to the website, as I'm sure a lot of others 
will have the same problem.


It would be great if you could start improving the javadoc, its totally 
fine if your english is not perfect, we can always ask a native speaker 
to read over it. If you start working on the javadoc, please create a 
jira issue for that work before you start.


Best,
Sebastian



On 03/12/2014 09:30 AM, Kevin Moulart wrote:

I can confirm what Sebastian said, I'm fairly new on this and I did find
myself so desperate at some point that I almost gave up on Mahout dut to
lack of documentation, but my feeling is that it doesn't only concerns the
website : the API is too few documented as well. At this point there are no
simple way for a beginner to know what kind of format any one of the
algorithms expects and what it outputs exactly, how to chain processes
etc... They might go as far as reading the javadoc (although not everyone
does that) but they won't all, as I had to and did, download the sources
and try making sense of them to get the information.

Hopefully the mailing list is particularly active and one can find the
answer if he has time and will to search them and ask kindly, which is a
very strong strength of Mahout, but the average beginner, wanting to just
try the library can't and won't do that.

I'm willing to document the parts of the code I used and began to
understand, however I've been facing difficulties to set up the maven
project in eclipse for now. Also since I'm a Belgian, English is not my
mother tongue so I'm almost certain to make mistakes, but I think it would
take less time to you to correct these few English mistakes than to write
the documentation :)
I'll go ahead and try to set thing up with Eclipse and if I don't succeed
I'll write a mail on the dev list for help in that matter.

I also can, if I find the time, continue my efforts of reporting bugs and
not working or accurate links and descriptions on the website, if need be
and update my JIRA entry accordingly.

Kévin Moulart


2014-03-12 8:48 GMT+01:00 Pavan Kumar N pavan.naraya...@gmail.com:


i ll help with clustering algorithms documentation. do send me old
documentation and i will check and remove errors.  or better let me know
how to proceed.

Pavan
On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

As you've probably noticed, I've put in a lot of effort over the last

days

to kickstart cleaning up our website. I've thrown out a lot of stuff and
have been startled by the amout of outdated and incorrect information on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for

new

people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't

have

the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification and
Clustering on the website. For the algorithms, the content and claims

of

the articles need to be checked, for the examples we need to make sure

that

everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour

for

that in the evening. She will go to our website, download Mahout, read

the

description of an algorithm and try to run an example. In the current

state

of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.









Re: Website, urgent help needed

2014-03-12 Thread Sebastian Schelter

Here you can see all issues (resolved and unresolved) for the next release:

https://issues.apache.org/jira/browse/MAHOUT-1413?jql=project%20%3D%20MAHOUT%20AND%20fixVersion%20%3D%201.0%20ORDER%20BY%20priority%20DESC

When you start to work on the cleanup of a page, make sure that there is 
no ticket existing for that. If it isnt, create a jira ticket with the 
name of the page in the title.


--sebastian


On 03/12/2014 11:20 AM, pramit choudhary wrote:

Hi All,
 I would also like to participate in cleaning up the documentation.
Since, I am fairly new to the Mahout infrastructure. It will in-turn help
me understand things better. Do we already have a Jira ticket for
organizing the cleaning up of documentation ?
Just want to be sure, that I am not stepping on pages some else has already
updated.

Thanks
Regards,
Pramit


On Wed, Mar 12, 2014 at 3:07 AM, Sebastian Schelter s...@apache.org wrote:


Hi Kevin,

Thank you for offer to help! Feel free to ask questions here how to setup
the sources in Eclipse. If you succeed, you could writeup what you did and
we could add this to the website, as I'm sure a lot of others will have the
same problem.

It would be great if you could start improving the javadoc, its totally
fine if your english is not perfect, we can always ask a native speaker to
read over it. If you start working on the javadoc, please create a jira
issue for that work before you start.

Best,
Sebastian




On 03/12/2014 09:30 AM, Kevin Moulart wrote:


I can confirm what Sebastian said, I'm fairly new on this and I did find
myself so desperate at some point that I almost gave up on Mahout dut to
lack of documentation, but my feeling is that it doesn't only concerns the
website : the API is too few documented as well. At this point there are
no
simple way for a beginner to know what kind of format any one of the
algorithms expects and what it outputs exactly, how to chain processes
etc... They might go as far as reading the javadoc (although not everyone
does that) but they won't all, as I had to and did, download the sources
and try making sense of them to get the information.

Hopefully the mailing list is particularly active and one can find the
answer if he has time and will to search them and ask kindly, which is a
very strong strength of Mahout, but the average beginner, wanting to just
try the library can't and won't do that.

I'm willing to document the parts of the code I used and began to
understand, however I've been facing difficulties to set up the maven
project in eclipse for now. Also since I'm a Belgian, English is not my
mother tongue so I'm almost certain to make mistakes, but I think it would
take less time to you to correct these few English mistakes than to write
the documentation :)
I'll go ahead and try to set thing up with Eclipse and if I don't succeed
I'll write a mail on the dev list for help in that matter.

I also can, if I find the time, continue my efforts of reporting bugs and
not working or accurate links and descriptions on the website, if need be
and update my JIRA entry accordingly.

Kévin Moulart


2014-03-12 8:48 GMT+01:00 Pavan Kumar N pavan.naraya...@gmail.com:

  i ll help with clustering algorithms documentation. do send me old

documentation and i will check and remove errors.  or better let me know
how to proceed.

Pavan
On Mar 12, 2014 12:35 PM, Sebastian Schelter s...@apache.org wrote:

  Hi,


As you've probably noticed, I've put in a lot of effort over the last


days


to kickstart cleaning up our website. I've thrown out a lot of stuff and
have been startled by the amout of outdated and incorrect information on
our website, as well as links pointing to nowhere.

I think our lack of documentation makes it superhard to use Mahout for


new


people. A crucial next step is to clean up the documentation on
classification and clustering. I cannot do this alone, because I don't


have


the time and I'm not so familiar with the background of the algorithms.

I need volunteers to go through all the pages under Classification and
Clustering on the website. For the algorithms, the content and claims


of


the articles need to be checked, for the examples we need to make sure


that


everything still works as described. It would also be great to move
articles from personal blogs to our website.

Imagine that some developer wants to try out Mahout and takes one hour


for


that in the evening. She will go to our website, download Mahout, read


the


description of an algorithm and try to run an example. In the current


state


of the documentation, I'm afraid that most people will walk away
frustrated, because the website does not help them as it should.

Best,
Sebastian

PS: I will make my standpoint on whether Mahout should do a 1.0 release
depend on whether we manage to clean up and maintain our documentation.















Re: Few questions about SVM configuration in Mahout

2014-03-10 Thread Sebastian Schelter

Hi Quentin,

Mahout does not have SVMs.

Best,
Sebastian

On 03/10/2014 10:38 AM, Quentin-Gabriel Thurier wrote:

Hi all,

Just few questions about the configuration of an SVM in Mahout :

- Is it possible to do a multi-class classification ?
- Which kernels are already available (linear, polynomial, rbf) ?
- Where can we find details about the way the algorithm has been
distributed ?

Many thanks,

Quentin





Re: [blog post] Comparing Document Classification Functions of Lucene and Mahout

2014-03-09 Thread Sebastian Schelter
Hi Koji,

I've added a link to your article to our website:

https://mahout.apache.org/general/books-tutorials-and-talks.html

On 03/07/2014 03:29 AM, Koji Sekiguchi wrote:
 Hello,
 
 I just posted an article on Comparing Document Classification Functions
 of Lucene and Mahout.
 
 http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html
 
 Comments are welcome. :)
 
 Thanks!
 
 koji
 



Re: Heap space

2014-03-09 Thread Sebastian Schelter
I usually do try and error. Start with some very large value and do a 
binary search :)


--sebastian

On 03/09/2014 01:30 PM, Mahmood Naderan wrote:

Excuse me, I added the -Xmx option and restarted the hadoop services using
sbin/stop-all.sh  sbin/start-all.sh

however still I get heap size error. How can I find the correct and needed heap 
size?


Regards,
Mahmood



On Sunday, March 9, 2014 1:37 PM, Mahmood Naderan nt_mahm...@yahoo.com wrote:

OK  I found that I have to add this property to mapred-site.xml


property
namemapred.child.java.opts/name
value-Xmx2048m/value
/property



Regards,
Mahmood




On Sunday, March 9, 2014 11:39 AM, Mahmood Naderan nt_mahm...@yahoo.com wrote:

Hello,
I ran this command

 ./bin/mahout wikipediaXMLSplitter -d 
examples/temp/enwiki-latest-pages-articles.xml -o wikipedia/chunks -c 64

but got this error
  Exception in thread main java.lang.OutOfMemoryError: Java heap space

There are many web pages regarding this and the solution is to add -Xmx 2048M for 
example. My question is, that option should be passed to java command and not Mahout. As  result, 
running ./bin/mahout -Xmx 2048M shows that there is no such option. What should I do?


Regards,
Mahmood





Re: Welcome Andrew Musselman as new comitter

2014-03-08 Thread Sebastian Schelter

Hi Pavan,

Committership is given for engagement with the project like providing 
documentation, answering questions on the mailinglist, reviewing 
patches, testing patches and submitting patches.


We currently have a discussion ongoing about the future of mahout, feel 
free to participate.


--sebastian


On 03/07/2014 06:41 PM, Pavan Kumar N wrote:

Congratulations to Andrew. Would be nice to have some
information/background on how PMC evaluated Andrew to become committer.
Also would be nice what future aspects/algorithms of machine learning is
mahout is going to focus on.

I have been keen to maintain code for one of the projects and mistakenly I
spent time on developing map reduce version of weighted linear regression
solutions procedure. Only recently I saw mahout's webpages are updated.
Would appreciate any advice from Andrew and other PMC members.

Pavan


On 7 March 2014 22:56, Frank Scholten fr...@frankscholten.nl wrote:


Congratulations Andrew!


On Fri, Mar 7, 2014 at 6:12 PM, Sebastian Schelter s...@apache.org wrote:


Hi,

this is to announce that the Project Management Committee (PMC) for

Apache

Mahout has asked Andrew Musselman to become committer and we are pleased

to

announce that he has accepted.

Being a committer enables easier contribution to the project since in
addition to posting patches on JIRA it also gives write access to the

code

repository. That also means that now we have yet another person who can
commit patches submitted by others to our repo *wink*

Andrew, we look forward to working with you in the future. Welcome! It
would be great if you could introduce yourself with a few words :)

Sebastian









Welcome Andrew Musselman as new comitter

2014-03-07 Thread Sebastian Schelter

Hi,

this is to announce that the Project Management Committee (PMC) for 
Apache Mahout has asked Andrew Musselman to become committer and we are 
pleased to announce that he has accepted.


Being a committer enables easier contribution to the project since in 
addition to posting patches on JIRA it also gives write access to the 
code repository. That also means that now we have yet another person who 
can commit patches submitted by others to our repo *wink*


Andrew, we look forward to working with you in the future. Welcome! It 
would be great if you could introduce yourself with a few words :)


Sebastian


Re: Rework our website

2014-03-06 Thread Sebastian Schelter
Thank you very much! Could you create a jira ticket and post the links 
there? That would be awesome, then we can track that this stuff gets fixed.


Best,
Sebastian

On 03/06/2014 02:58 PM, Kevin Moulart wrote:

Hi I also prefer the second one.

While I'm at it, there are several links that point to absent pages. I just
clicked on all the link present on page :
http://mahout.apache.org/users/basics/quickstart.html

And those links are broken :
http://mahout.apache.org/users/basics/recommender-documentation.html
http://mahout.apache.org/users/classification/partial-implementation.html
http://mahout.apache.org/users/basics/TasteCommandLine
http://mahout.apache.org/users/recommender/recommendationexamples.html
http://mahout.apache.org/users/basics/parallel-frequent-pattern-mining.html
http://mahout.apache.org/users/basics/mahout.ga.tutorial.html
http://hadoop.apache.org.html/

That's just the ones I found in 2 minutes on the quickstart page.

Best Regards,
Kevin


2014-03-05 23:43 GMT+01:00 Sebastian Schelter s...@apache.org:


At the moment, only committers can change the website unfortunately. If
you have a text to add, I'm happy to work it in and add your name to our
contributers list in the CHANGELOG.

Best,
Sebastian



On 03/05/2014 04:58 PM, Scott C. Cote wrote:


I had recently taken the text tour of mahout, but I couldn't decipher a
way to contribute updates to the tour (some of the file names have
changed, etc).

How would I start?   (this was part of my offer to help with the
documentation of Mahout).

SCott

On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote:

  What no centered text??


;-)

Love either.

BTW users are no longer able to contribute content to the wiki. Most CMSs
have a way to allow input that is moderated. Might this make getting
documentation help easier? Allow anyone to contribute but committers can
filter out the bad‹sort of like submitting patches.

On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of
documentation on our website is one of the main pain points of Mahout
atm. To be honest, I'm also not very happy with the design, especially
fonts and spacing make it super hard to read long articles. This also
prevents me from wanting to add articles and documentation.

I think we should have a beautiful website, where it is fun to add new
stuff.

My design skills are pretty limited, but fortunately my brother is an art
director! I asked him to make our website a bit more beautiful without
changing to much of the structure, so that a redesign wouldn't take too
long.

I really like the results and would volunteer to dig out my CSS skills
and do the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian














Re: Rework our website

2014-03-06 Thread Sebastian Schelter

Could you add the missing pages to the jira issue? I'll have a look later.

On 03/06/2014 03:25 PM, Suneel Marthi wrote:

I fixed some of the broken links. For some of others eg: TasteCommandline, 
Recommendationexamples either the pages have not been migrated or the links 
have to be purged?






On Thursday, March 6, 2014 9:07 AM, Sebastian Schelter s...@apache.org wrote:

Thank you very much! Could you create a jira ticket and post the links
there? That would be awesome, then we can track that this stuff gets fixed.

Best,
Sebastian


On 03/06/2014 02:58 PM, Kevin Moulart wrote:

Hi I also prefer the second one.

While I'm at it, there are several links that point to absent pages. I just
clicked on all the link present on page :
http://mahout.apache.org/users/basics/quickstart.html

And those links are broken :
http://mahout.apache.org/users/basics/recommender-documentation.html
http://mahout.apache.org/users/classification/partial-implementation.html
http://mahout.apache.org/users/basics/TasteCommandLine
http://mahout.apache.org/users/recommender/recommendationexamples.html
http://mahout.apache.org/users/basics/parallel-frequent-pattern-mining.html
http://mahout.apache.org/users/basics/mahout.ga.tutorial.html
http://hadoop.apache.org.html/

That's just the ones I found in 2 minutes on the quickstart page.

Best Regards,
Kevin


2014-03-05 23:43 GMT+01:00 Sebastian Schelter s...@apache.org:


At the moment, only committers can change the website unfortunately. If
you have a text to add, I'm happy to work it in and add your name to our
contributers list in the CHANGELOG.

Best,
Sebastian



On 03/05/2014 04:58 PM, Scott C. Cote wrote:


I had recently taken the text tour of mahout, but I couldn't decipher a
way to contribute updates to the tour (some of the file names have
changed, etc).

How would I start?   (this was part of my offer to help with the
documentation of Mahout).

SCott

On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote:

What no centered text??


;-)

Love either.

BTW users are no longer able to contribute content to the wiki. Most CMSs
have a way to allow input that is moderated. Might this make getting
documentation help easier? Allow anyone to contribute but committers can
filter out the bad‹sort of like submitting patches.

On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of
documentation on our website is one of the main pain points of Mahout
atm. To be honest, I'm also not very happy with the design, especially
fonts and spacing make it super hard to read long articles. This also
prevents me from wanting to add articles and documentation.

I think we should have a beautiful website, where it is fun to add new
stuff.

My design skills are pretty limited, but fortunately my brother is an art
director! I asked him to make our website a bit more beautiful without
changing to much of the structure, so that a redesign wouldn't take too
long.

I really like the results and would volunteer to dig out my CSS skills
and do the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian














Re: Recommend items not rated by any user

2014-03-05 Thread Sebastian Schelter

Hi Juan,

that is a good catch. CandidateItemsStrategy is the right place to 
implement this. Maybe we should simply extend its interface to add a 
parameter that says whether to keep or remove the current users items?


We could even do this in the abstract base class then.

--sebastian

On 03/05/2014 10:42 AM, Juan José Ramos wrote:

In case somebody runs into the same situation, the key seems to be in the
CandidateItemStrategy being passed to the constructor
of GenericItemBasedRecommender. Looking into the code, if no
CandidateItemStrategy is specified in the
constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and
as the documentation says, the doGetCandidateItems method: returns all
items that have not been rated by the user and that were preferred by
another user that has preferred at least one item that the current user has
preferred too.

So, a different CandidateItemStrategy needs to be passed. For this problem,
it seems to me that AllSimilarItemsCandidateItemsStrategy,
AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody
know where to find some documentation about the different
CandidateItemStrategy? Based on the name I would say that:
1) AllSimilarItemsCandidateItemsStrategy returns all similar items
regardless of whether they have been already rated by someone or not.
2) AllUnknownItemsCandidateItemsStrategy returns all similar items that
have not been rated by anyone yet.

Does anybody know if it works like that?
Thanks.


On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com wrote:


First thing is thatI know this requirement would not make sense in a CF
Recommender. In my case, I am trying to use Mahout to create something
closer to a Content-Based Recommender.

In particular, I am pre-computing a similarity matrix between all the
documents (items) of my catalogue and using that matrix as the
ItemSimilarity for my Item-Based Recommender.

So, when a user rates a document, how could I make the recommender outputs
similar documents to that ones the user has already rated even if no other
user in the system has rated them yet? Is that even possible in the first
place?

Thanks a lot.







Rework our website

2014-03-05 Thread Sebastian Schelter

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of 
documentation on our website is one of the main pain points of Mahout 
atm. To be honest, I'm also not very happy with the design, especially 
fonts and spacing make it super hard to read long articles. This also 
prevents me from wanting to add articles and documentation.


I think we should have a beautiful website, where it is fun to add new 
stuff.


My design skills are pretty limited, but fortunately my brother is an 
art director! I asked him to make our website a bit more beautiful 
without changing to much of the structure, so that a redesign wouldn't 
take too long.


I really like the results and would volunteer to dig out my CSS skills 
and do the redesign, if people agree.


Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian


Re: Recommend items not rated by any user

2014-03-05 Thread Sebastian Schelter

On 03/05/2014 01:23 PM, Juan José Ramos wrote:

Thanks for the reply, Sebastian.

I am not sure if that should be implemented in the Abstract base class
though because for
instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition,
it returns the item not rated by the user and rated by somebody else.


Good point. So we seem to need special implementations.



Back to my last post, I have been playing around with
AllSimilarItemsCandidateItemsStrategy
and AllUnknownItemsCandidateItemsStrategy, and although they both do what I
wanted (recommend items not previously rated by any user), I honestly can't
tell the difference between the two strategies. In my tests the output was
always the same. If the eventual output of the recommender will not include
items already rated by the user as pointed out here (
http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E),
AllSimilarItemsCandidateItemsStrategy should be equivalent to
AllUnkownItemsCandidateItemsStrategy, shouldn't it?


AllSimilarItems returns all items that are similar to any item that the 
user already knows. AllUnknownItems simply returns all items that the 
user has not interacted with yet.


These are two different things, although they might overlap in some 
scenarios.


Best,
Sebastian




Thanks.

On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org wrote:


Hi Juan,

that is a good catch. CandidateItemsStrategy is the right place to

implement this. Maybe we should simply extend its interface to add a
parameter that says whether to keep or remove the current users items?


We could even do this in the abstract base class then.

--sebastian


On 03/05/2014 10:42 AM, Juan José Ramos wrote:


In case somebody runs into the same situation, the key seems to be in the
CandidateItemStrategy being passed to the constructor
of GenericItemBasedRecommender. Looking into the code, if no
CandidateItemStrategy is specified in the
constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and
as the documentation says, the doGetCandidateItems method: returns all
items that have not been rated by the user and that were preferred by
another user that has preferred at least one item that the current user

has

preferred too.

So, a different CandidateItemStrategy needs to be passed. For this

problem,

it seems to me that AllSimilarItemsCandidateItemsStrategy,
AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody
know where to find some documentation about the different
CandidateItemStrategy? Based on the name I would say that:
1) AllSimilarItemsCandidateItemsStrategy returns all similar items
regardless of whether they have been already rated by someone or not.
2) AllUnknownItemsCandidateItemsStrategy returns all similar items that
have not been rated by anyone yet.

Does anybody know if it works like that?
Thanks.


On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com

wrote:



First thing is thatI know this requirement would not make sense in a CF
Recommender. In my case, I am trying to use Mahout to create something
closer to a Content-Based Recommender.

In particular, I am pre-computing a similarity matrix between all the
documents (items) of my catalogue and using that matrix as the
ItemSimilarity for my Item-Based Recommender.

So, when a user rates a document, how could I make the recommender

outputs

similar documents to that ones the user has already rated even if no

other

user in the system has rated them yet? Is that even possible in the

first

place?

Thanks a lot.











Re: Recommend items not rated by any user

2014-03-05 Thread Sebastian Schelter

 So both strategies seems to be effectively the same, I don't know what
 the implementers had in mind when designing
 AllSimilarItemsCandidateItemsStrategy.

It can take a long time to estimate preferences for all items a user 
doesn't know. Especially if you have a lot of items. Traditional 
item-based recommenders will not recommend any item that is not similar 
to at least one of the items the user interacted with, so 
AllSimilarItemsStrategy already selects the maximum set of items that 
could be potentially recommended to the user.


--sebastian



On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:

If the similarity between item 5 and two of the items user 1 preferred are not
NaN then it will return 1, that is what I'm saying. If the
similarities were all NaN then
it will not return it.

But surely, you might wonder if all similarities between an item and
user's items are NaN, then
AllUnknownItemsCandidateItemsStrategy probably will not return it.




On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:

@Tevfik, running this recommender:

GenericItemBasedRecommender itemRecommender = new
GenericItemBasedRecommender(dataModel, itemSimilarity, new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


With this dataModel:
1,1,1.0
1,2,2.0
1,3,1.0
1,4,2.0
2,1,1.0
2,2,4.0


And these similarities
1,2,0.1
1,3,0.2
1,4,0.3
2,3,0.5
3,4,0.5
5,1,0.2
5,2,1.0

Returns item 5 for User 1. So item 5 has not been preferred by user 1, and
the similarity between item 5 and two of the items user 1 preferred are not
NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So,
I'm truly sorry to insist on this, but I still really do not get the
difference.


On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote:


Juan,
You got me wrong,

AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

So, it does not simply return all items that have not been rated by
the user. For example, if there is an item X which has not been rated
by the user and if the similarity value between X and at least one of
the items rated (preferred) by the user is not NaN, then X will be not
be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
returned by AllUnknownItemsCandidateItemsStrategy.



On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote:

Hi Tefik,

Thanks for the response. I think what you says contradicts what Sebastian
pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy

returns

all items that have not been rated by the user, what would
AllUnknownItemsCandidateItemsStrategy return?


On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com
wrote:


Sorry there was a typo in the previous paragraph.

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 

tevfik.ayte...@gmail.com

wrote:

Hi Juan,

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value that is with at
least one of the items preferred by the user.

Tevfik

On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org

wrote:

On 03/05/2014 01:23 PM, Juan José Ramos wrote:


Thanks for the reply, Sebastian.

I am not sure if that should be implemented in the Abstract base

class

though because for
instance PreferredItemsNeighborhoodCandidateItemsStrategy, by

definition,

it returns the item not rated by the user and rated by somebody

else.



Good point. So we seem to need special implementations.




Back to my last post, I have been playing around with
AllSimilarItemsCandidateItemsStrategy
and AllUnknownItemsCandidateItemsStrategy, and although they both do

what

I
wanted (recommend items not previously rated by any user), I

honestly

can't
tell the difference between the two strategies. In my tests the

output

was

always the same. If the eventual output of the recommender will not
include
items already rated by the user as pointed out here (





http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E

),

AllSimilarItemsCandidateItemsStrategy should be equivalent to
AllUnkownItemsCandidateItemsStrategy, shouldn't it?



AllSimilarItems returns all items that are similar to any item that

the

user

already knows. AllUnknownItems simply returns all items that the user

has

not interacted with yet.

These are two different things, although they might overlap in some
scenarios.

Best

Re: Recommend items not rated by any user

2014-03-05 Thread Sebastian Schelter
For SVD based algorithms, you would should use the AllUnknownItems 
Strategy then, thats correct.


In the majority of industry usecases that I have seen, people use 
pre-computed item similarities (Mahout has lots of machinery for doing 
this, btw), so AllSimilarItems totally makes sense there.


--sebastian

On 03/05/2014 06:01 PM, Tevfik Aytekin wrote:

It can even make things worse in SVD-based algorithms for which
preference estimation is very fast.

On Wed, Mar 5, 2014 at 7:00 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote:

Hi Sebastian,
But in order not to select items that is not similar to at least one
of the items the user interacted with you have to compute the
similarity with all user items (which is the main task for estimating
the preference of an item in item-based method). So, it seems to me
that AllSimilarItemsStrategy does not bring much advantage over
AllUnknownItemsCandidateItemsStrategy.

On Wed, Mar 5, 2014 at 6:46 PM, Sebastian Schelter s...@apache.org wrote:

So both strategies seems to be effectively the same, I don't know what
the implementers had in mind when designing
AllSimilarItemsCandidateItemsStrategy.


It can take a long time to estimate preferences for all items a user doesn't
know. Especially if you have a lot of items. Traditional item-based
recommenders will not recommend any item that is not similar to at least one
of the items the user interacted with, so AllSimilarItemsStrategy already
selects the maximum set of items that could be potentially recommended to
the user.

--sebastian




On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:


If the similarity between item 5 and two of the items user 1 preferred are
not
NaN then it will return 1, that is what I'm saying. If the
similarities were all NaN then
it will not return it.

But surely, you might wonder if all similarities between an item and
user's items are NaN, then
AllUnknownItemsCandidateItemsStrategy probably will not return it.




On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:


@Tevfik, running this recommender:

GenericItemBasedRecommender itemRecommender = new
GenericItemBasedRecommender(dataModel, itemSimilarity, new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


With this dataModel:
1,1,1.0
1,2,2.0
1,3,1.0
1,4,2.0
2,1,1.0
2,2,4.0


And these similarities
1,2,0.1
1,3,0.2
1,4,0.3
2,3,0.5
3,4,0.5
5,1,0.2
5,2,1.0

Returns item 5 for User 1. So item 5 has not been preferred by user 1,
and
the similarity between item 5 and two of the items user 1 preferred are
not
NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item.
So,
I'm truly sorry to insist on this, but I still really do not get the
difference.


On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin
tevfik.ayte...@gmail.comwrote:


Juan,
You got me wrong,

AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

So, it does not simply return all items that have not been rated by
the user. For example, if there is an item X which has not been rated
by the user and if the similarity value between X and at least one of
the items rated (preferred) by the user is not NaN, then X will be not
be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
returned by AllUnknownItemsCandidateItemsStrategy.



On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com
wrote:


Hi Tefik,

Thanks for the response. I think what you says contradicts what
Sebastian
pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy


returns


all items that have not been rated by the user, what would
AllUnknownItemsCandidateItemsStrategy return?


On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin
tevfik.ayte...@gmail.com
wrote:


Sorry there was a typo in the previous paragraph.

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 


tevfik.ayte...@gmail.com


wrote:


Hi Juan,

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value that is with at
least one of the items preferred by the user.

Tevfik

On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org


wrote:


On 03/05/2014 01:23 PM, Juan José Ramos wrote:



Thanks for the reply, Sebastian.

I am not sure if that should be implemented in the Abstract base


class


though because for
instance PreferredItemsNeighborhoodCandidateItemsStrategy, by


definition,


it returns the item not rated by the user and rated by somebody


else.




Good point. So we seem to need special

Re: Rework our website

2014-03-05 Thread Sebastian Schelter
At the moment, only committers can change the website unfortunately. If 
you have a text to add, I'm happy to work it in and add your name to our 
contributers list in the CHANGELOG.


Best,
Sebastian


On 03/05/2014 04:58 PM, Scott C. Cote wrote:

I had recently taken the text tour of mahout, but I couldn't decipher a
way to contribute updates to the tour (some of the file names have
changed, etc).

How would I start?   (this was part of my offer to help with the
documentation of Mahout).

SCott

On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote:


What no centered text??

;-)

Love either.

BTW users are no longer able to contribute content to the wiki. Most CMSs
have a way to allow input that is moderated. Might this make getting
documentation help easier? Allow anyone to contribute but committers can
filter out the bad‹sort of like submitting patches.

On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of
documentation on our website is one of the main pain points of Mahout
atm. To be honest, I'm also not very happy with the design, especially
fonts and spacing make it super hard to read long articles. This also
prevents me from wanting to add articles and documentation.

I think we should have a beautiful website, where it is fun to add new
stuff.

My design skills are pretty limited, but fortunately my brother is an art
director! I asked him to make our website a bit more beautiful without
changing to much of the structure, so that a redesign wouldn't take too
long.

I really like the results and would volunteer to dig out my CSS skills
and do the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian








Re: Mahout-232-0.8.patch using

2014-03-04 Thread Sebastian Schelter
I think you should rather choose a different library that already offers 
an SVM than trying to revive a 4 year old patch.


--sebastian

On 03/04/2014 08:51 AM, Amol Kakade wrote:

Hi,
I am new user of Mahout and want to run sample SVM algorithm with Mahout.
Can you please list me steps to use Mahout-232-0.8.patch for SVM in Mahout
I have been trying for last 2 days but getting errors.
--
Amol  Kakade.





Re: how to recommend users already consumed items

2014-03-04 Thread Sebastian Schelter
I think we should introduce a new parameter for the recommend() method 
in the Recommender interface that tells whether already known items 
should be recommended or not.


What do you think?

Best,
Sebastian

On 03/04/2014 05:32 PM, Pat Ferrel wrote:

I’d suggest a command line option if you want to submit a patch. Most people 
will want that line executed so the default should be the current behavior. But 
a large minority will want it your way.

And please do submit a patch with the Jira, it will make your life easier when 
new releases come out you won’t have to manage a fork.

On Mar 2, 2014, at 12:38 PM, Mario Levitin mariolevi...@gmail.com wrote:

Juan, I don't understand your solution, if there are no ratings how can you
blend the recommendations from the system and the user's already read news.

Anyway, I think, as Pat does, the best way is to remove the mentioned line.
It should be the responsibility of the business logic to remove user's
items if needed.

I will also create a Jira issue as you suggested.

thanks
On Sun, Mar 2, 2014 at 7:12 PM, Ted Dunning ted.dunn...@gmail.com wrote:


On Sun, Mar 2, 2014 at 8:52 AM, Pat Ferrel p...@occamsmachete.com wrote:


You are not the only one to see this so I'd recommend creating an option
for the Job, which will be checked before executing that line of code

then

submit it as a patch to the Jira you need to create in any case.

That way it might get into the mainline and you won't have to maintain a
fork.



Avoiding the cost of a fork over a trivial issue like this is a grand idea.







Re: how to recommend users already consumed items

2014-03-04 Thread Sebastian Schelter

That's fine, I was talking about the non-distributed part only.

This page has instructions on how to create patches:

https://mahout.apache.org/developers/how-to-contribute.html

Let me know if you need more infos!

Best,
Sebastian


On 03/05/2014 12:27 AM, Mario Levitin wrote:

I have created a Jira issue already.
I only use the non-hadoop part of Mahout recommender algorithms.
May be I can create a patch for that part. However, I have not done it
before, and don't know how to proceed.


On Wed, Mar 5, 2014 at 1:01 AM, Sebastian Schelter s...@apache.org wrote:


Would you be willing to set up a jira issue and create a patch for this?

--sebastian


On 03/04/2014 11:58 PM, Mario Levitin wrote:




I think we should introduce a new parameter for the recommend() method in
the Recommender interface that tells whether already known items should
be
recommended or not.




I agree (if the parameter is missing then defaults to current behavior as
Pat suggested)







On 03/04/2014 05:32 PM, Pat Ferrel wrote:



  I'd suggest a command line option if you want to submit a patch. Most

people will want that line executed so the default should be the current
behavior. But a large minority will want it your way.

And please do submit a patch with the Jira, it will make your life
easier
when new releases come out you won't have to manage a fork.

On Mar 2, 2014, at 12:38 PM, Mario Levitin mariolevi...@gmail.com
wrote:

Juan, I don't understand your solution, if there are no ratings how can
you
blend the recommendations from the system and the user's already read
news.

Anyway, I think, as Pat does, the best way is to remove the mentioned
line.
It should be the responsibility of the business logic to remove user's
items if needed.

I will also create a Jira issue as you suggested.

thanks
On Sun, Mar 2, 2014 at 7:12 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

   On Sun, Mar 2, 2014 at 8:52 AM, Pat Ferrel p...@occamsmachete.com


wrote:

   You are not the only one to see this so I'd recommend creating an
option


for the Job, which will be checked before executing that line of code

  then


  submit it as a patch to the Jira you need to create in any case.


That way it might get into the mainline and you won't have to
maintain a
fork.


  Avoiding the cost of a fork over a trivial issue like this is a grand

idea.

















Re: Issue updating a FileDataModel

2014-03-03 Thread Sebastian Schelter

Hi Juan,

IIRC then FileDataModel has a parameter that determines how much time 
must have been spent since the last modification of the underlying file. 
You can also directly append new data to the original file.


If you want a to have a DataModel that can be concurrently updated, I 
suggest your data to a database.


--sebastian

On 03/02/2014 11:11 PM, Juan José Ramos wrote:

I am having issues refreshing my recommender, in particular with the
DataModel.

I am using a FileDataModel and a GenericItemBasedRecommender that also has
a CachingItemSimilarity wrapping a FileItemSimilarity. But for the test I
am running I am making things even simpler.

By the time I instantiate the recommender, these two files are in the
FileSystem:
data/datamodel.txt
0,1,0.0

data/datamodel.0.txt
0,2,1.0

And then I run the code you can find below:

---

   FileDataModel dataModel = new FileDataModel(new File(data/dataModel.txt
));

FileItemSimilarity itemSimilarity = new FileItemSimilarity(new File(
data/similarities));

  GenericItemBasedRecommender itemRecommender =
newGenericItemBasedRecommender(dataModel, itemSimilarity);

System.out.println(Number of users in the system:  +
itemRecommender.getDataModel().getNumUsers()+ and  +
itemRecommender.getDataModel().getNumItems() + items);

  FileWriter writer = new FileWriter(new File(data/dataModel.1.txt));

  writer.write(1,2,1.0\r);

  writer.close();

writer = new FileWriter(new File(data/dataModel.2.txt));

  writer.write(2,2,1.0\r);

  writer.close();

writer = new FileWriter(new File(data/dataModel.3.txt));

  writer.write(3,2,1.0\r);

  writer.close();

writer = new FileWriter(new File(data/dataModel.4.txt));

  writer.write(4,2,1.0\r);

  writer.close();

writer = new FileWriter(new File(data/dataModel.5.txt));

  writer.write(5,2,1.0\r);

  writer.close();

writer = new FileWriter(new File(data/dataModel.6.txt));

  writer.write(6,2,1.0\r);

  writer.close();

  itemRecommender.refresh(null);

  System.out.println(Number of users in the system:  +
itemRecommender.getDataModel().getNumUsers()+ and  +
itemRecommender.getDataModel().getNumItems() + items);

---

The output is the same in both println: Number of users in the system: 2
and 2items. So, only the information from the files that were on the system
by the time I run this test seem to get loaded on the DataModel.

What can be causing that? Is there a maximum number of updates a
FileDataModel can take up in every refresh?

Could it be that actually by the time I call itemRecommender.refresh(null)
the files have not been written to the FileSystem?

Should I be calling refresh in a different manner?

Thank you for your help.





Re: classification in standalone application in Apache Mahout 0.9

2014-03-03 Thread Sebastian Schelter
If you don't want to call a shell, I assume you don't want to use a 
Hadoop cluster, right? In that case, you should rather try Mahout's 
logistic regression classifier, which is tuned for usage on a single 
machine.


--sebastian

On 03/03/2014 03:07 PM, Hollow Quincy wrote:

I am looking for simple example in Java (without any shell call) how
to use NaiveBayesClassifier in Apache Mahout 0.9.

I have a samples of text. I want to learn algorithm base on this data
and that I want to classify a new text.

class Main {
 public static void main(String[] args) {
 //learn algorithm base on some data
 //classify some data
 }
}

There is no example how to do it in Apache Mahout 0.9 !

Thanks for help





Re: classification in standalone application in Apache Mahout 0.9

2014-03-03 Thread Sebastian Schelter
Its certainly possible to run Hadoop on a single machine, but it will 
give you terrible performance. We don't have a single machine 
implementation of naive bayes, so I'd really suggest you use the 
logistic regression code.


--sebastian

On 03/03/2014 03:15 PM, Hollow Quincy wrote:

You are right. I want to call my program on single machine in classic
public static void main() standalone application.
In my opinion Naive Bayes Classification would suit great to my problem.
Is there a way to call it from my java code ?
I cannot find any example..

Thanks for help

2014-03-03 15:11 GMT+01:00 Sebastian Schelter s...@apache.org:

If you don't want to call a shell, I assume you don't want to use a Hadoop
cluster, right? In that case, you should rather try Mahout's logistic
regression classifier, which is tuned for usage on a single machine.

--sebastian


On 03/03/2014 03:07 PM, Hollow Quincy wrote:


I am looking for simple example in Java (without any shell call) how
to use NaiveBayesClassifier in Apache Mahout 0.9.

I have a samples of text. I want to learn algorithm base on this data
and that I want to classify a new text.

class Main {
  public static void main(String[] args) {
  //learn algorithm base on some data
  //classify some data
  }
}

There is no example how to do it in Apache Mahout 0.9 !

Thanks for help







Re: Issue updating a FileDataModel

2014-03-03 Thread Sebastian Schelter
I think it depends on the difference between the time of the call to 
refresh() and the last modified time of the file.


--sebastian

On 03/03/2014 04:45 PM, Juan José Ramos wrote:

Thanks for the reply, Sebastian.

I do not have concurrent updates, but they actually may happen very, very
close in time.

Would the fact of adding the new preferences to new files or appending to
the existing one make any difference or does everything depends on the time
elapsed between two calls to recommender.refresh(null)?

Many thanks.


On Mon, Mar 3, 2014 at 1:18 PM, Sebastian Schelter s...@apache.org wrote:


Hi Juan,

IIRC then FileDataModel has a parameter that determines how much time must
have been spent since the last modification of the underlying file. You can
also directly append new data to the original file.

If you want a to have a DataModel that can be concurrently updated, I
suggest your data to a database.

--sebastian


On 03/02/2014 11:11 PM, Juan José Ramos wrote:


I am having issues refreshing my recommender, in particular with the
DataModel.

I am using a FileDataModel and a GenericItemBasedRecommender that also has
a CachingItemSimilarity wrapping a FileItemSimilarity. But for the test I
am running I am making things even simpler.

By the time I instantiate the recommender, these two files are in the
FileSystem:
data/datamodel.txt
0,1,0.0

data/datamodel.0.txt
0,2,1.0

And then I run the code you can find below:


---

FileDataModel dataModel = new FileDataModel(new
File(data/dataModel.txt
));

 FileItemSimilarity itemSimilarity = new FileItemSimilarity(new File(
data/similarities));

   GenericItemBasedRecommender itemRecommender =
newGenericItemBasedRecommender(dataModel, itemSimilarity);


 System.out.println(Number of users in the system:  +
itemRecommender.getDataModel().getNumUsers()+ and  +
itemRecommender.getDataModel().getNumItems() + items);

   FileWriter writer = new FileWriter(new File(data/dataModel.1.txt));

   writer.write(1,2,1.0\r);

   writer.close();

 writer = new FileWriter(new File(data/dataModel.2.txt));

   writer.write(2,2,1.0\r);

   writer.close();

 writer = new FileWriter(new File(data/dataModel.3.txt));

   writer.write(3,2,1.0\r);

   writer.close();

 writer = new FileWriter(new File(data/dataModel.4.txt));

   writer.write(4,2,1.0\r);

   writer.close();

 writer = new FileWriter(new File(data/dataModel.5.txt));

   writer.write(5,2,1.0\r);

   writer.close();

 writer = new FileWriter(new File(data/dataModel.6.txt));

   writer.write(6,2,1.0\r);

   writer.close();

   itemRecommender.refresh(null);

   System.out.println(Number of users in the system:  +
itemRecommender.getDataModel().getNumUsers()+ and  +
itemRecommender.getDataModel().getNumItems() + items);


---

The output is the same in both println: Number of users in the system: 2
and 2items. So, only the information from the files that were on the
system
by the time I run this test seem to get loaded on the DataModel.

What can be causing that? Is there a maximum number of updates a
FileDataModel can take up in every refresh?

Could it be that actually by the time I call itemRecommender.refresh(null)
the files have not been written to the FileSystem?

Should I be calling refresh in a different manner?

Thank you for your help.










Re: Mahout-232-0.8.patch using

2014-03-03 Thread Sebastian Schelter

Hi Amol,

SVMs are not integrated in Mahout. I'd suggest you try our logistic 
regression classifier instead.


Best,
Sebastian

On 03/04/2014 08:51 AM, Amol Kakade wrote:

Hi,
I am new user of Mahout and want to run sample SVM algorithm with Mahout.
Can you please list me steps to use Mahout-232-0.8.patch for SVM in Mahout
I have been trying for last 2 days but getting errors.
--
Amol  Kakade.





Re: parallelALS and RMSE TEST

2014-03-01 Thread Sebastian Schelter
The output of parallelALS are two matrices U and M whose product is an 
approximation of your input matrix.


The matrices are outputed as sequence files with an IntWritable as key 
(the index of the row in the matrix) and a VectorWritable as value which 
holds the contents of the row vector.


--sebastian

On 02/27/2014 06:30 PM, AJ Rader wrote:


Sean Owen srowen at gmail.com writes:



Parallel ALS is exactly an example of where you can use matrix
factorization for 0/1 data.

On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin tevfik.aytekin at

gmail.com wrote:

Hi Sean,
Isn't boolean preferences is supported in the context of memory-based
recommendation algorithms in Mahout?
Are there matrix factorization algorithms in Mahout which can work
with this kind of data (that is, the kind of data which consists of
users and the movies they have seen).




On Mon, May 6, 2013 at 10:34 PM, Sean Owen srowen at gmail.com

wrote:

Yes, it goes by the name 'boolean prefs' in the project since target
variables don't have values -- they just exist or don't.
So, yes it's certainly supported but the question here is how to
evaluate the output.

On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin tevfik.aytekin at

gmail.com wrote:

This problem is called one-class classification problem. In the domain
of collaborative filtering it is called one-class collaborative
filtering (since what you have are only positive preferences). You may
search the web with these key words to find papers providing
solutions. I'm not sure whether Mahout has algorithms for one-class
collaborative filtering.

On Mon, May 6, 2013 at 1:42 PM, Sean Owen srowen at gmail.com

wrote:

ALS-WR weights the error on each term differently, so the average
error doesn't really have meaning here, even if you are comparing the
difference with 1. I think you will need to fall back to mean
average precision or something.

On Mon, May 6, 2013 at 11:24 AM, William icswilliam2010 at

gmail.com wrote:

Sean Owen srowen at gmail.com writes:



If you have no ratings, how are you using RMSE? this typically
measures error in reconstructing ratings.
I think you are probably measuring something meaningless.




I suppose the rate of seen movies are 1. Is it right?
If I use Collaborative Filtering with ALS-WR to get some

recommendations, I

must have a real rating-matrix?





I was wondering what kind of format the output produced by parallelALS is
stored in. More specifically I am looking for a way to decode/read this
information.

I have been able to run the mahout parallelALS command, calculate RMSE using
mahout evaluateFactorization, and generate recommendations via mahout
recommendfactorized.

However I would like to take a closer look at things like the factorized
products for my probeSet (stored in --tempDir from the 'mahout
evaluateFactorization' command) and the actual feature vectors stored in the
/out/U/ and /out/M/ directories.

thanks
AJ






Re: Load output of rowsimilarity to memory

2014-02-25 Thread Sebastian Schelter

Hi Juan,

It would definitely be nice to have that in the API! It would be great 
if you could submit a patch after you implemented this.


Best,
Sebastian

On 02/25/2014 10:52 AM, Juan José Ramos wrote:

Thanks for the answer.

That was the approach I had in mind in the first place the only difference
would be that I will write the output to a file that can be later used to
create a FileItemSimilarity.

I think that would be a very nice feature to have in the API.

Thanks again.


On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter s...@apache.org wrote:


I overlooked that you're interested in document similarities. Sry again :)

Another way would be to read the output of RowSimilarityJob with a
o.a.m.common.iterator.sequencefile.SequenceFileDirIterable

You create a list of instances of o.a.m.cf.taste.impl.similarity.
GenericItemSimilarity.ItemItemSimilarity

e.g. for the output


Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}

you would do

list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016));
list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565));
...

After that you create a GenericItemSimilarity from the list of
ItemItemSimilarities, which is the in-memory item similarity you asked for.

Hope that helps,
Sebastian



On 02/24/2014 10:04 PM, Juan José Ramos wrote:


Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be
for
item-based CF? In particular, in the documentation I can read that:
Preferences in the input file should look like
userID,itemID[,preferencevalue]

And in my case the input I have is just text documents and I want to
pre-compute similarities between them beforehand, even before any user has
expressed any preference value for any item.

In order to use ItemSimilarityJob for this purpose, what should be the
input I need to provide? Would it be the output of seq2sparse?

Thanks again.


On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter s...@apache.org
wrote:

  You're right, my bad. If you don't use RowSimilarityJob directly, but

org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
(which calls RowSimilarityJob under the covers), your output will be a
textfile that is directly usable with FileItemSimilarity.

--sebastian


On 02/24/2014 09:30 PM, Juan José Ramos wrote:

  Thanks for the prompt reply.


RowSimilarityJob produces an output in the form of:
Key: 0: Value: {61112:0.21139380179557016,
52144:0.23797846026935565,...}

whereas FileItemSimilarity is expecting a comma or tab separated inputs.

I assume that you meant that the output of RowSimilarityJob can be
loaded
by the FileItemSimilarity after doing the appropriate parsing. Is that
correct, or is there actually a way to load the raw output of
RowSimilarityJob into FileItemSimilarity?

Thanks.


On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org
wrote:

   The output of RowSimilarityJob can be loaded by the
FileItemSimilarity.



--sebastian


On 02/24/2014 08:31 PM, Juan José Ramos wrote:

   Is there a way to reproduce this process:


https://cwiki.apache.org/confluence/display/MAHOUT/
Quick+tour+of+text+analysis+using+the+Mahout+command+line

inside Java code and not using the command line tool? I am not
interested
in the clustering part but in 'Calculate several similar docs to each
doc
in the data'. In particular, I am interested in loading the output of
the
rowsimilarity tool into memory to be used as my custom ItemSimilarity
implementation for an ItemBasedRecommender.

What I exactly want is to have a matrix in memory where for every doc
in
my
catalogue I have the similarity with the 100 (that is the threshold I
am
using) most similar items an undefined similarity for the rest.

Is it possible to do with the Java API? I know it can be done calling
the
commands from inside the Java code and I guess that also using
corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix
and
RowItemSimilarityJob. But I still see cannot see an easy way of
parsing
the
output of RowItemSimilarityJob to the memory representation I intend
to
use.

Thanks a lot.




















Re: Load output of rowsimilarity to memory

2014-02-25 Thread Sebastian Schelter
If you iterate over the vector, you will get Vector.Element objects. 
elem.index() gives you the id of the similar thing, elem.get() gives you 
the similarity value.


--sebastian

On 02/25/2014 11:58 AM, Juan José Ramos wrote:

Regarding the parsing of a VectorWriteble object, what is the recommended
approach to access the different 'DocID: similarity' pairs?

I can see that if I get the String representation of the
org.apache.mahout.math.Vector
object it should not be hard to parse using the text representation.

However, is there a way to access the individual elements of the 'DocID:
similarity' pair? I tried iterating through the individual Vector.Element
objects and calling get(), but that does not return what I intend to.

More than happy to contribute to the project once I get this working.

Thanks a lot.

On Tue, Feb 25, 2014 at 9:52 AM, Juan José Ramos jjar...@gmail.com wrote:


Thanks for the answer.

That was the approach I had in mind in the first place the only difference
would be that I will write the output to a file that can be later used to
create a FileItemSimilarity.

I think that would be a very nice feature to have in the API.

Thanks again.


On Mon, Feb 24, 2014 at 9:27 PM, Sebastian Schelter s...@apache.orgwrote:


I overlooked that you're interested in document similarities. Sry again :)

Another way would be to read the output of RowSimilarityJob with a
o.a.m.common.iterator.sequencefile.SequenceFileDirIterable

You create a list of instances of o.a.m.cf.taste.impl.similarity.
GenericItemSimilarity.ItemItemSimilarity

e.g. for the output


Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}

you would do

list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016));
list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565));
...

After that you create a GenericItemSimilarity from the list of
ItemItemSimilarities, which is the in-memory item similarity you asked for.

Hope that helps,
Sebastian



On 02/24/2014 10:04 PM, Juan José Ramos wrote:


Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be
for
item-based CF? In particular, in the documentation I can read that:
Preferences in the input file should look like
userID,itemID[,preferencevalue]

And in my case the input I have is just text documents and I want to
pre-compute similarities between them beforehand, even before any user
has
expressed any preference value for any item.

In order to use ItemSimilarityJob for this purpose, what should be the
input I need to provide? Would it be the output of seq2sparse?

Thanks again.


On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter s...@apache.org
wrote:

  You're right, my bad. If you don't use RowSimilarityJob directly, but

org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
(which calls RowSimilarityJob under the covers), your output will be a
textfile that is directly usable with FileItemSimilarity.

--sebastian


On 02/24/2014 09:30 PM, Juan José Ramos wrote:

  Thanks for the prompt reply.


RowSimilarityJob produces an output in the form of:
Key: 0: Value: {61112:0.21139380179557016,
52144:0.23797846026935565,...}

whereas FileItemSimilarity is expecting a comma or tab separated
inputs.

I assume that you meant that the output of RowSimilarityJob can be
loaded
by the FileItemSimilarity after doing the appropriate parsing. Is that
correct, or is there actually a way to load the raw output of
RowSimilarityJob into FileItemSimilarity?

Thanks.


On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org
wrote:

   The output of RowSimilarityJob can be loaded by the
FileItemSimilarity.



--sebastian


On 02/24/2014 08:31 PM, Juan José Ramos wrote:

   Is there a way to reproduce this process:


https://cwiki.apache.org/confluence/display/MAHOUT/
Quick+tour+of+text+analysis+using+the+Mahout+command+line

inside Java code and not using the command line tool? I am not
interested
in the clustering part but in 'Calculate several similar docs to each
doc
in the data'. In particular, I am interested in loading the output of
the
rowsimilarity tool into memory to be used as my custom ItemSimilarity
implementation for an ItemBasedRecommender.

What I exactly want is to have a matrix in memory where for every
doc in
my
catalogue I have the similarity with the 100 (that is the threshold
I am
using) most similar items an undefined similarity for the rest.

Is it possible to do with the Java API? I know it can be done calling
the
commands from inside the Java code and I guess that also using
corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix
and
RowItemSimilarityJob. But I still see cannot see an easy way of
parsing
the
output of RowItemSimilarityJob to the memory representation I intend
to
use.

Thanks a lot.






















Re: Use Naïve Bayes on a large CSV

2014-02-24 Thread Sebastian Schelter
NaiveBayes expects a SequenceFile as input. The key is the class label 
as Text, the value are the features as VectorWritable.


--sebastian

On 02/24/2014 11:51 AM, Kevin Moulart wrote:

Hi again,
I finally set my mind on going through java to make a sequence file for the
naive bayes,
but I still can't manage to find anyplace stating exactly what should be in
the sequence file
for mahout to process it with Naive Bayes.

I tried virtually every piece of code i found related to this subject, with
no luck.

My CSV file is like this :
Label that I want to predict, feature 1, feature 2, ..., feature 1628

Could someone tell me exactly what Naive Bayes training procedure expects ?


2014-02-20 13:56 GMT+01:00 Jay Vyas jayunit...@gmail.com:


This relates to a previous question I have:  Does mahout have a concept of
adapters which allow us to read data csv style data with filters to create
exact format  for its various inputs (i.e. Recommender three column
format).?  If not is it worth a jira?



On Feb 20, 2014, at 7:50 AM, Kevin Moulart kevinmoul...@gmail.com

wrote:


Hi and thanks !

What about the command line, is there a way to do that using the existing
command line ?




2014-02-20 12:02 GMT+01:00 Suneel Marthi suneel_mar...@yahoo.com:


To convert input CSV to vectors, u can either:

a) Use CSVIterator
b) use InputDriver

Either of the above should generate vectors from input CSV that could

then

be fed into Mahout classifier/clustering jobs.





On Thursday, February 20, 2014 5:57 AM, Kevin Moulart 
kevinmoul...@gmail.com wrote:

Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from
the command line.

I know I have to feed the classifier with a seq file, so I tried to put

my

csv into one using the command seqdirectory, but even when I try with a
really small csv (less than 100Mo) I instantly get an

outOfMemoryException

from java heap space :

mahout seqdirectory -i /user/cacf/Echant/testSeq -o

/user/cacf/resSeq

-ow
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using

/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop

and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments:
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
--input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
--output=[/user/cacf/resSeq],

--overwrite=null, --startPhase=[0],

--tempDir=[temp]}
14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq
Exception in thread main java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at



java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)

at



java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)

at

java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)

at java.lang.StringBuilder.append(StringBuilder.java:132)
at



org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62)

at



org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90)

at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
at

org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)

at



org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at



org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at



sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at



sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at



org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)

at

org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)

at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at



sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at



sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)



Do you have an idea or a simple way to use Naive Bayes against my large

CSV

?

Thanks in advance !
--
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45




--
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45










Re: Load output of rowsimilarity to memory

2014-02-24 Thread Sebastian Schelter

The output of RowSimilarityJob can be loaded by the FileItemSimilarity.

--sebastian

On 02/24/2014 08:31 PM, Juan José Ramos wrote:

Is there a way to reproduce this process:
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line

inside Java code and not using the command line tool? I am not interested
in the clustering part but in 'Calculate several similar docs to each doc
in the data'. In particular, I am interested in loading the output of the
rowsimilarity tool into memory to be used as my custom ItemSimilarity
implementation for an ItemBasedRecommender.

What I exactly want is to have a matrix in memory where for every doc in my
catalogue I have the similarity with the 100 (that is the threshold I am
using) most similar items an undefined similarity for the rest.

Is it possible to do with the Java API? I know it can be done calling the
commands from inside the Java code and I guess that also using
corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and
RowItemSimilarityJob. But I still see cannot see an easy way of parsing the
output of RowItemSimilarityJob to the memory representation I intend to
use.

Thanks a lot.





Re: Load output of rowsimilarity to memory

2014-02-24 Thread Sebastian Schelter
You're right, my bad. If you don't use RowSimilarityJob directly, but 
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob 
(which calls RowSimilarityJob under the covers), your output will be a 
textfile that is directly usable with FileItemSimilarity.


--sebastian

On 02/24/2014 09:30 PM, Juan José Ramos wrote:

Thanks for the prompt reply.

RowSimilarityJob produces an output in the form of:
Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}

whereas FileItemSimilarity is expecting a comma or tab separated inputs.

I assume that you meant that the output of RowSimilarityJob can be loaded
by the FileItemSimilarity after doing the appropriate parsing. Is that
correct, or is there actually a way to load the raw output of
RowSimilarityJob into FileItemSimilarity?

Thanks.


On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org wrote:


The output of RowSimilarityJob can be loaded by the FileItemSimilarity.

--sebastian


On 02/24/2014 08:31 PM, Juan José Ramos wrote:


Is there a way to reproduce this process:
https://cwiki.apache.org/confluence/display/MAHOUT/
Quick+tour+of+text+analysis+using+the+Mahout+command+line

inside Java code and not using the command line tool? I am not interested
in the clustering part but in 'Calculate several similar docs to each doc
in the data'. In particular, I am interested in loading the output of the
rowsimilarity tool into memory to be used as my custom ItemSimilarity
implementation for an ItemBasedRecommender.

What I exactly want is to have a matrix in memory where for every doc in
my
catalogue I have the similarity with the 100 (that is the threshold I am
using) most similar items an undefined similarity for the rest.

Is it possible to do with the Java API? I know it can be done calling the
commands from inside the Java code and I guess that also using
corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and
RowItemSimilarityJob. But I still see cannot see an easy way of parsing
the
output of RowItemSimilarityJob to the memory representation I intend to
use.

Thanks a lot.










Re: Load output of rowsimilarity to memory

2014-02-24 Thread Sebastian Schelter

I overlooked that you're interested in document similarities. Sry again :)

Another way would be to read the output of RowSimilarityJob with a 
o.a.m.common.iterator.sequencefile.SequenceFileDirIterable


You create a list of instances of 
o.a.m.cf.taste.impl.similarity.GenericItemSimilarity.ItemItemSimilarity


e.g. for the output

Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}

you would do

list.add(new ItemItemSimilarity(0, 61112, 0.21139380179557016));
list.add(new ItemItemSimilarity(0, 52144, 0.23797846026935565));
...

After that you create a GenericItemSimilarity from the list of 
ItemItemSimilarities, which is the in-memory item similarity you asked for.


Hope that helps,
Sebastian


On 02/24/2014 10:04 PM, Juan José Ramos wrote:

Correct me if I'm wrong, but is it not the ItemSimilarityJob mean to be for
item-based CF? In particular, in the documentation I can read that:
Preferences in the input file should look like
userID,itemID[,preferencevalue]

And in my case the input I have is just text documents and I want to
pre-compute similarities between them beforehand, even before any user has
expressed any preference value for any item.

In order to use ItemSimilarityJob for this purpose, what should be the
input I need to provide? Would it be the output of seq2sparse?

Thanks again.


On Mon, Feb 24, 2014 at 8:54 PM, Sebastian Schelter s...@apache.org wrote:


You're right, my bad. If you don't use RowSimilarityJob directly, but
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
(which calls RowSimilarityJob under the covers), your output will be a
textfile that is directly usable with FileItemSimilarity.

--sebastian


On 02/24/2014 09:30 PM, Juan José Ramos wrote:


Thanks for the prompt reply.

RowSimilarityJob produces an output in the form of:
Key: 0: Value: {61112:0.21139380179557016,52144:0.23797846026935565,...}

whereas FileItemSimilarity is expecting a comma or tab separated inputs.

I assume that you meant that the output of RowSimilarityJob can be loaded
by the FileItemSimilarity after doing the appropriate parsing. Is that
correct, or is there actually a way to load the raw output of
RowSimilarityJob into FileItemSimilarity?

Thanks.


On Mon, Feb 24, 2014 at 7:41 PM, Sebastian Schelter s...@apache.org
wrote:

  The output of RowSimilarityJob can be loaded by the FileItemSimilarity.


--sebastian


On 02/24/2014 08:31 PM, Juan José Ramos wrote:

  Is there a way to reproduce this process:

https://cwiki.apache.org/confluence/display/MAHOUT/
Quick+tour+of+text+analysis+using+the+Mahout+command+line

inside Java code and not using the command line tool? I am not
interested
in the clustering part but in 'Calculate several similar docs to each
doc
in the data'. In particular, I am interested in loading the output of
the
rowsimilarity tool into memory to be used as my custom ItemSimilarity
implementation for an ItemBasedRecommender.

What I exactly want is to have a matrix in memory where for every doc in
my
catalogue I have the similarity with the 100 (that is the threshold I am
using) most similar items an undefined similarity for the rest.

Is it possible to do with the Java API? I know it can be done calling
the
commands from inside the Java code and I guess that also using
corresponding SparseVectorsFromSequenceFiles, DistributedRowMatrix and
RowItemSimilarityJob. But I still see cannot see an easy way of parsing
the
output of RowItemSimilarityJob to the memory representation I intend to
use.

Thanks a lot.















Re: Mahout on Spark?

2014-02-19 Thread Sebastian Schelter

Completely agree with Sean's statement.

On 02/19/2014 01:52 PM, Sean Owen wrote:

To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.

You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas jayunit...@gmail.com wrote:

+100 for this, different execution engines, like the direction  pig and crunch 
take

Sent from my iPhone


On Feb 19, 2014, at 5:19 AM, Gokhan Capan gkhn...@gmail.com wrote:

I imagine in Mahout offering an option to the users to select from
different execution engines (just like we currently do by giving M/R or
sequential options), and starting from Spark. I am not sure what changes
needed in the codebase, though. Maybe following MLI (or alike) and
implementing some more stuff, such as common interfaces for iterating over
data (the M/R way and the Spark way).

IMO, another effort might be porting pre-online machine learning (such
transforming text into vector based on the dictionary generated by
seq2sparse before), machine learning based on mini-batches, and streaming
summarization stuff in Mahout to Spark-Streaming.

Best,
Gokhan

On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:


PS I am moving along cost optimizer for spark-backed DRMs on some
multiplicative pipelines that is capable of figuring different cost-based
rewrites and R-Like DSL that mixes in-core and distributed matrix
representations and blocks but it is painfully slow, i really only doing it
like couple nights in a month. It does not look like i will be doing it on
company time any time soon (and even if i did, the company doesn't seem to
be inclined to contribute anything I do anything new on their time). It is
all painfully slow, there's no direct funding for it anywhere with no
string attached. That probably will be primary reason why Mahout would not
be able to get much traction compared to university-based contributions.


On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov dlie...@gmail.com

wrote:



Unfortunately methinks the prospects of something like Mahout/MLLib merge
seem very unlikely due to vastly diverged approach to the basics of

linear

algebra (and other things). Just like one cannot grow single tree out of
two trunks -- not easily, anyway.

It is fairly easy to port (and subsequently beat) MLib at this point from
collection of algorithms point of view. But IMO goal should be more
MLI-like first, and port second. And be very careful with concepts.
Something that i so far don't see happening with MLib. MLib seems to be
old-style Mahout-like rush to become a collection of basic algorithms
rather than coherent foundation. Admittedly, i havent looked very

closely.



On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter s...@apache.org
wrote:


I'm also convinced that Spark is a superior platform for executing
distributed ML algorithms. We've had a discussion about a change from
Hadoop to another platform some time ago, but at that point in time it

was

not clear which of the upcoming dataflow processing systems (Spark,
Hyracks, Stratosphere) would establish itself amongst the users. To me

it

seems pretty obvious that Spark made the race.

I concur with Ted, it would be great to have the communities work
together. I know that at least 4 mahout committers (including me) are
already following Spark's mailinglist and actively participating in the
discussions.

What are the ideas how a fruitful cooperation look like?

Best,
Sebastian

PS:

I ported LLR-based cooccurrence analysis (aka item-based recommendation)
to Spark some time ago, but I haven't had time to test my code on a

large

dataset yet. I'd be happy to see someone help with that.







On 02/19/2014 08:04 AM, Nick Pentreath wrote:

I know the Spark/Mllib devs can occasionally be quite set in ways of
doing certain things, but we'd welcome as many Mahout devs as possible

to

work together.


It may be too late, but perhaps a GSoC project to look at a port of

some

stuff like co occurrence recommender and streaming k-means?




N
--
Sent from Mailbox for iPhone

On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning ted.dunn...@gmail.com
wrote:

On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath 

nick.pentre...@gmail.comwrote:


My (admittedly heavily biased) view is Spark is a superior platform
overall
for ML. If the two communities can work together to leverage the
strengths
of Spark, and the large amount of good stuff in Mahout (as well as

the

fantastic depth of experience of Mahout devs) I think a lot can be
achieved!

It makes a lot of sense that Spark would be better than Hadoop for

ML

purposes given that Hadoop was intended to do web-crawl kinds of

things

and
Spark was intentionally built to support machine

Re: Mahout on Spark?

2014-02-18 Thread Sebastian Schelter
I'm also convinced that Spark is a superior platform for executing 
distributed ML algorithms. We've had a discussion about a change from 
Hadoop to another platform some time ago, but at that point in time it 
was not clear which of the upcoming dataflow processing systems (Spark, 
Hyracks, Stratosphere) would establish itself amongst the users. To me 
it seems pretty obvious that Spark made the race.


I concur with Ted, it would be great to have the communities work 
together. I know that at least 4 mahout committers (including me) are 
already following Spark's mailinglist and actively participating in the 
discussions.


What are the ideas how a fruitful cooperation look like?

Best,
Sebastian

PS:

I ported LLR-based cooccurrence analysis (aka item-based recommendation) 
to Spark some time ago, but I haven't had time to test my code on a 
large dataset yet. I'd be happy to see someone help with that.






On 02/19/2014 08:04 AM, Nick Pentreath wrote:

I know the Spark/Mllib devs can occasionally be quite set in ways of doing 
certain things, but we'd welcome as many Mahout devs as possible to work 
together.


It may be too late, but perhaps a GSoC project to look at a port of some stuff 
like co occurrence recommender and streaming k-means?




N
—
Sent from Mailbox for iPhone

On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning ted.dunn...@gmail.com
wrote:


On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath nick.pentre...@gmail.comwrote:

My (admittedly heavily biased) view is Spark is a superior platform overall
for ML. If the two communities can work together to leverage the strengths
of Spark, and the large amount of good stuff in Mahout (as well as the
fantastic depth of experience of Mahout devs) I think a lot can be
achieved!


It makes a lot of sense that Spark would be better than Hadoop for ML
purposes given that Hadoop was intended to do web-crawl kinds of things and
Spark was intentionally built to support machine learning.
Given that Spark has been announced by a majority of the Hadoop-based
distribution vendors, it makes sense that maybe Mahout should jump in.
I really would prefer it if the two communities (MLib/MLI and Mahout) could
work more closely together.  There is a lot of good to be had on both sides.




Re: get similar items

2014-02-12 Thread Sebastian Schelter

Hi,

Mahout's recommenders are based on analyzing interactions between users 
and items/movies, e.g. ratings or counts how often the movie was watched.



On 02/12/2014 11:34 AM, N! wrote:

Hi all:
  Does anyone have any suggestions for the questions below?


  thanks a lot.


-- Original --
Sender: N!12481...@qq.com;
Send time: Wednesday, Feb 12, 2014 6:17 PM
To: useruser@mahout.apache.org;

Subject: Re: get similar items



Hi Sean:
 Thanks for the reply.
 Assume I have only one table named 'movie' with 1000+ records, 
this table have three columns:'id','movieName','movieDescription'.
 Can Mahout calculate the most similar movies for a movie.(based on 
only the 'movie' table)?
 code like: List mostSimilarMovieList = recommender.mostSimilar(int 
movieId).
 if not, do you have any suggestions for this scenario?





Re: Mahout algorithms

2014-02-05 Thread Sebastian Schelter
That is outdated unfortunately. I will send a list of current algorithms 
shortly.


--sebastian


On 02/05/2014 11:13 AM, Chameera Wijebandara wrote:

Hi Sergey,

This will help.

https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Thanks,
 Chameera


On Wed, Feb 5, 2014 at 3:30 PM, Sergey Svinarchuk 
ssvinarc...@hortonworks.com wrote:


Hi,

Where can I see all algorithms which include mahout 0.9 and documentation
for this algorithm?

Thanks,
Sergey!

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.







Re: Mahout algorithms

2014-02-05 Thread Sebastian Schelter

Hi Sergey,

here is the list of algorithms. We're currently in the progress of 
reworking our wiki, that's why the documentation is unfortunately 
incorrect at the moment. I've added a ticket for this:

https://issues.apache.org/jira/browse/MAHOUT-1413

Here's the current list of algorithms in Mahout 0.9

Recommenders (non-distributed):

 - user-based collaborative filtering
 - item-based collaborative filtering
 - latent-factor models (SGD, SVD++, ALS)

Recommender (distributed):

 - item-based collaborative filtering
 - latent-factor models (ALS)

Classification (non-distributed):

 - logistic regression solved with SGD
 - Multilayer Perceptron
 - Hidden Markov Models

Classification (distributed):

 - Naive Bayes
 - Random Forests

Clustering (distributed)

 - Canopy
 - k-Means
 - streaming k-Means
 - fuzzy k-Means
 - spectral k-Means

Topic Models (distributed)

 - Latent Dirichlet Allocation

Frequent Pattern Mining (distributed)

Math (distributed)

 - SVD using the Lanczos algorithm
 - Stochastic SVD

Hope that helps.

Best,
Sebastian



On 02/05/2014 11:00 AM, Sergey Svinarchuk wrote:

Hi,

Where can I see all algorithms which include mahout 0.9 and documentation
for this algorithm?

Thanks,
Sergey!





Re: SGD classifier demo app

2014-02-04 Thread Sebastian Schelter

Would be great to add this as an example to Mahout's codebase.

On 02/04/2014 10:27 AM, Ted Dunning wrote:

Frank,

I just munched on your code and sent a pull request.

In doing this, I made a bunch of changes.  Hope you liked them.

These include massive simplification of the reading and vectorization.
  This wasn't strictly necessary, but it seemed like a good idea.

More important was the way that I changed the vectorization.  For the
continuous values, I added log transforms.  For the categorical values, I
encoded as they are.  I also increased the feature vector size to 100 to
avoid excessive collisions.

In the learning code itself, I got rid of the use of index arrays in favor
of shuffling the training data itself.  I also tuned the learning
parameters a lot.

The result is that the AUC that results is just a tiny bit less than 0.9
which is pretty close to what I got in R.

For everybody else, see
https://github.com/tdunning/mahout-sgd-bank-marketing for my version and
https://github.com/tdunning/mahout-sgd-bank-marketing/compare/frankscholten:master...masterfor
my pull request.



On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning ted.dunn...@gmail.com wrote:



Johannes,

Very good comments.

Frank,

As a benchmark, I just spent a few minutes building a logistic regression
model using R.  For this model AUC on 10% held-out data is about 0.9.

Here is a gist summarizing the results:

https://gist.github.com/tdunning/8794734




On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte 
johannes.schu...@gmail.com wrote:


Hi Frank,

you are using the feature vector encoders which hash a combination of
feature name and feature value to 2 (default) locations in the vector. The
vector size you configured is 11 and this is imo very small to the
possible
combination of values you have for your data (education, marital,
campaign). You can do no harm by using a much bigger cardinality (try
1000).

Second, you are using a continuous value encoder with passing in the
weight
your are using as string (e.g. variable pDays). I am not quite sure
about
the reasons in th mahout code right now but the way it is implemented now,
every unique value should end up in a different location because the
continuous value is part of the hashing. Try adding the weight directly
using a static word value encoder, addToVector(pDays,v,pDays)

Last, you are also putting in the variable campaign as a continous
variable which should be probably a categorical variable, so just added
with a StaticWorldValueEncoder.

And finally and probably most important after looking at your target
variable: you are using a Dictionary for mapping either y or no to 0 or 1.
This is bad. Depending on what comes first in the data set, either a
positive or negative example might be 0 or 1, totally random. Make a hard
mapping from the possible values (y/n?) to zero and one, having yes the 1
and no the zero.





On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten fr...@frankscholten.nl

wrote:



Hi all,

I am exploring Mahout's SGD classifier and like some feedback because I
think I didn't properly configure things.

I created an example app that trains an SGD classifier on the 'bank
marketing' dataset from UCI:
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

My app is at:

https://github.com/frankscholten/mahout-sgd-bank-marketing


The app reads a CSV file of telephone calls, encodes the features into a
vector and tries to predict whether a customer answers yes to a business
proposal.

I do a few runs and measure accuracy but I'm I don't trust the results.
When I only use an intercept term as a feature I get around 88% accuracy
and when I add all features it drops to around 85%. Is this perhaps

because

the dataset highly unbalanced? Most customers answer no. Or is the
classifier biased to predict 0 as the target code when it doesn't have

any

data to go with?

Any other comments about my code or improvements I can make in the app

are

welcome! :)

Cheers,

Frank












Re: Mahout 0.9 Release

2014-02-02 Thread Sebastian Schelter

Hi Suneel,

Thats great news, thank you for driving this release!

On 02/02/2014 10:22 PM, Suneel Marthi wrote:

Mahout 0.9 has been pushed to the mirrors and is available for download at 
http://www.apache.org/dyn/closer.cgi/mahout/




On Friday, January 31, 2014 11:21 PM, Suneel Marthi suneel_mar...@yahoo.com 
wrote:

The release has passed with the required votes from PMC, will be pushing 0.9 to 
the mirrors and updating the release notes over the next day or two.




On Thursday, January 30, 2014 2:16 AM, Stevo Slavić ssla...@gmail.com wrote:

+1



On Wed, Jan 29, 2014 at 10:56 PM, Shannon Quinn squ...@gatech.edu wrote:


LGTM


On 1/29/14, 4:27 PM, peng wrote:


+1, can't see a bad side.

On Wed 29 Jan 2014 11:33:02 AM EST, Suneel Marthi wrote:


+1 from me





On Wednesday, January 29, 2014 8:58 AM, Sebastian Schelter 
s...@apache.org wrote:

+1


On 01/29/2014 05:25 AM, Andrew Musselman wrote:


Looks good.

+1


On Tue, Jan 28, 2014 at 8:07 PM, Andrew Palumbo ap@outlook.com
wrote:

   a), b), c), d) all passed here.


CosineDistance of clustered points from cluster-reuters.sh -1 kmeans
were
within the range [0,1].

   Date: Tue, 28 Jan 2014 16:45:42 -0800

From: suneel_mar...@yahoo.com
Subject: Mahout 0.9 Release
To: user@mahout.apache.org; d...@mahout.apache.org

Fixed the issues that were reported with Clustering code this past
week,


upgraded codebase to Lucene 4.6.1 that was released today.



Here's the URL for the 0.9 release in staging:-

   https://repository.apache.org/content/repositories/

orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/



The artifacts have been signed with the following key:
https://people.apache.org/keys/committer/smarthi.asc

Please:-
a) Verify that u can unpack the release (tar or zip)
b) Verify u r able to compile the distro
c)  Run through the unit tests: mvn clean test
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run


through all the different options in each script.



Need a minimum of 3 '+1' votes from PMC for the release to be
finalized.













Re: generic latent variable recommender question

2014-01-24 Thread Sebastian Schelter

Case 1 is fine as is.

For Case 2 I would suggest to simply experiment, try different 
similarity measures like euclidean distance or cosine and see what gives 
the best results.


--sebastian

On 01/25/2014 04:08 AM, Koobas wrote:

A generic latent variable recommender question.
I passed the user-item matrix through a low rank approximation,
with either something like ALS or SVD, and now I have the feature
vectors for all users and all items.

Case 1:
I want to recommend items to a user.
I compute a dot product of the user’s feature vector with all feature
vectors of all the items.
I eliminate the ones that the user already has, and find the largest value
among the others, right?

Case 2:
I want to find similar items for an item.
Should I compute dot product of the item’s feature vector against feature
vectors of all the other items?
OR
Should I compute the ANGLE between each par of feature vectors?
I.e., compute the cosine similarity?
I.e., normalize the vectors before computing the dot products?

If “yes” for case 2, is that something I should also do for case 1?





Re: Pig local mode issue

2014-01-22 Thread Sebastian Schelter
I think this question is better suited for the mailinglist of the pig 
project.


On 01/23/2014 01:24 AM, Sameer Tilak wrote:

Hi All,My script runs find in map reduce mode, but I get the following error 
when I run it in the local mode. I have made sure that the i/p file exists. I 
am not sure why map reduce is coming into picture when it is local mode.
pig -x local myscript.pig

2014-01-22 16:14:02,771 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File 
concatenation threshold: 100 optimistic? false2014-01-22 16:14:02,805 [main] 
INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 32014-01-22 16:14:02,806 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 2 map-only splittees.2014-01-22 16:14:02,806 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - Merged 2 out of total 3 MR operators.2014-01-22 16:14:02,806 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 12014-01-22 16:14:02,845 [main] INFO  
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to 
the job2014-01-22 16:14:02,865 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.J

o
bControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set 
to default 0.32014-01-22 16:14:02,876 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Using reducer estimator: 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator2014-01-22
 16:14:02,878 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
 - BytesPerReducer=10 maxReducers=999 
totalInputFileSize=99408652014-01-22 16:14:02,878 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting Parallelism to 12014-01-22 16:14:02,909 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up multi store job2014-01-22 16:14:02,918 [main] INFO  
org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will 
not generate code.2014-01-22 16:14:02,918 [main] INFO  org.apache.pig.data.Sc
h
emaTupleFrontend - Starting process to move generated code to distributed 
cacche2014-01-22 16:14:02,918 [main] INFO  
org.apache.pig.data.SchemaTupleFrontend - Distributed cache not supported or 
needed in local mode. Setting key [pig.schematuple.local.dir] with code temp 
directory: /tmp/1390436042918-02014-01-22 16:14:02,978 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 1 map-reduce job(s) waiting for submission.2014-01-22 16:14:02,991 
[JobControl] INFO  org.apache.hadoop.util.NativeCodeLoader - Loaded the 
native-hadoop library2014-01-22 16:14:02,994 [JobControl] ERROR 
org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException 
as:userid cause:ENOENT: No such file or directory2014-01-22 16:14:03,479 [main] 
INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete2014-01-22 16:14:03,489 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Oo
o
ps! Some job has failed! Specify -stop_on_failure if you want Pig to stop 
immediately on failure.2014-01-22 16:14:03,489 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- job null has failed! Stop running all dependent jobs2014-01-22 16:14:03,490 
[main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 100% complete2014-01-22 16:14:03,492 [main] ERROR 
org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate 
exception from backend error: ENOENT: No such file or directoryat 
org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)  at 
org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:699)   at 
org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:654)   at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)  at 
org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFil
e
System.java:189)at 
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856) at 
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at 
java.security.AccessController.doPrivileged(Native Method)   at 
javax.security.auth.Subject.doAs(Subject.java:415)   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 at 

Re: Problem with ItemSimilarityJob, empty part-r-00000

2014-01-21 Thread Sebastian Schelter

Hi Quentin,

Have you checked the log to ensure that you don't get any exceptions 
during the computation?


Could you test the job with a tiny example where you can calculate the 
result by hand?


Can you share an input file on which this job fails?

--sebastian

On 01/21/2014 11:22 AM, Quentin-Gabriel Thurier wrote:

I encounter few troubles with Mahout that I can't sort out..

The context is that I'm trying to calculate pairwise euclidean distances
between music tracks based on 6 audio features per track. My input for the
mahout job is a text file which looks like this:

feature_id,track_id,feature_value
integer, integer,double

This command works locally for less than 600 tracks (based on
mahout-core-0.7-cdh4.5.0-job.jar):

mahout itemsimilarity --input input/msd_sample/mahout --output
output/mahout --similarityClassname
SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false --maxSimilaritiesPerItem 1

But for more tracks I get an empty file part-r-. I tried to decrease
the --threshold parameter but I still don't have any result.

I also tried to launch the job on aws EMR with the equivalent input for
3000 tracks (based on mahout-core-0.8-job.jar):

org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input
s3n://hadoop-filrouge/input/msd-sample/mahout --output
s3n://hadoop-filrouge/output/mahout/01202014-itemsimilarity
--similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false
--maxSimilaritiesPerItem 1

The job runs successfully but I get 17 empty part-r-000xx..

I'm totally stuck right now and I'm running out of idea to fix this issue.
So if anydody only have a little idea of what is going on, that could
really help.

Many thanks,





  1   2   3   4   5   6   >