from:"Sean Owen"

Re: Is Mahout obsolete now?

2015-10-19 Thread Sean Owen

No, this is pretty wrong. Spark is not, in general, a real-time
anything. Spark Streaming is a near-real-time streaming framework, but
it is not something you can build models with. Spark MLlib / ML are
offline / batch. Not sure what you mean by Hadoop engine, but Spark
does not build on MapReduce, if that's what you mean.

The "classic" Mahout code (<= 0.9) is definitely deprecated. The "new"
Mahout is not. It has a fairly different new recommender system called
Samsara. It has Scala APIs. In fact, it uses Spark. I think you're
somehow talking about the "classic" Mahout code here only.

On Mon, Oct 19, 2015 at 2:38 PM, Fei Shan  wrote:
> Spark is a in memory , near realtime Machine Learning frameowork , has
> scala and java interface
> Mahout is an offline Machine Learning framework, no scala apis
>
> they both built on the HDFS and Hadoop engine
>
> Spark has an ecosystem like Hadoop
> Mahout is part of of Hadoop ecosystem
>
> Spark could beat Mahout on processing speed
> and concise programming APIs
>
> for online data anaysis , Spark is a better choice.
> for offline data analysis, both fits well.
>
>
>
> On Mon, Oct 19, 2015 at 9:14 PM, Prasad Priyadarshana Fernando <
> bpp...@gmail.com> wrote:
>
>> Hi,
>>
>> If I have used Mahout for my recommendation application, should I migrate
>> into Spark MLib technology? Is the mahout still supported and migrated?
>>
>> Thanks
>>
>> *Prasad Priyadarshana Fernando > >*
>> Mobile: +1 330 283 5827
>>

Re: Negative preferences

2014-08-15 Thread Sean Owen

I have used thumbs-down-like interactions as like an anti-click, and
subtracts from the interaction between the user and item. The negative
scores can be naturally applied in a matrix-factorization-like model
like ALS, but that's not the situation here.

Others probably have better first-hand experience here, but yes I have
heard of recommending to the negative actions as well and ranking
results by the difference between the positive and negative predicted
rating. That is, subtract out the scores from the negative recs.
Filtering is a crude but more efficient version of this.

On Thu, Aug 14, 2014 at 6:22 PM, Pat Ferrel  wrote:
> Now that we have multi-action/cross-cooccurrences in ItemSimilarity we can 
> start playing with taking in multiple actions to recommend one. On the demo 
> site I have data for thumbs up and down but have only been using thumbs up as 
> the primary action. I then filter recs by a user’s thumbs down interactions. 
> However there are now some new options.
>
> 1) Might it be better to use the thumbs down as a second action type? 
> Basically this would imply that a user’s dislike of certain items may be an 
> indicator of their liking others? Since we are using Solr to return recs we’d 
> just use a two field query so no need to combine recs.
>
> 2) Get completely independent thumbs-down recs and filter by those instead of 
> only the thumbs-down interactions? Probably a pretty tight threshold or 
> number of items recommended would be good here to protect against false 
> negatives.
>
> The data is there and the demo site is pretty easy to experiment with. I’m 
> integrating spark-itemsimilarity now so if anyone has a good idea of how to 
> better use the data, speak up. It seems like 1 and 2 could be used together 
> so I’ll probably create some setting that allows a user to experiment on 
> their own recs.

Re: ALS, weighed vs. non-weighed regularization paper

2014-06-16 Thread Sean Owen

Yeah I've turned that over in my head. I am not sure I have a great
answer. But I interpret the net effect to be that the model prefers
simple explanations for active users, at the cost of more error in the
approximation. One would rather pick a basis that more naturally
explains the data observed in active users. I think I can see that
this could be a useful assumption -- these users are less extremely
sparse.


On Mon, Jun 16, 2014 at 8:50 PM, Dmitriy Lyubimov  wrote:
> Probably a question for Sebastian.
>
> As we know, the two papers (Hu-Koren-Volynsky and Zhou et. al) use slightly
> different loss functions.
>
> Zhou et al. are fairly unique in that they multiply norm of U, V vectors
> additionally by the number of observied interactions.
>
> The paper doesn't explain why it works except saying along the lines of "we
> tried several regularization matrices, and this one worked better in our
> case".
>
> I tried to figure why that is. And still not sure why it would be better.
> So b asically we say, by allowing smaller sets of observation having
> smaller regularization values, it is ok for smaller observation sets to
> overfit slightly more than larger observations sets.
>
> This seems to be counterintuitive. Intuition tells us, smaller sets
> actually would tend to overfit more, not less, and therefore might possibly
> use larger regularization rate, not smaller one. Sebastian, what's your
> take on weighing regularization in ALS-WR?
>
> thanks.
> -d

Re: Does Mahout handle missing values in train and test data, for Decision Forest?

2014-04-22 Thread Sean Owen

>From looking at the code recently, no it is not handled.

On Tue, Apr 22, 2014 at 1:27 PM, Himanshu  wrote:
> In Weka it is possible to mark the field with a question mark "?" for unknown
> values and these are handled. Is there a similar way to mark
> "unknown"/"missing" field values in Mahout training and test data as well.
>
> Appreciate any suggestions/pointers. Breiman talks about two ways to handle
> missing values.
>

RE: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1

2014-04-01 Thread Sean Owen

evdmsdbs01: 
> /ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout/mahout-distribution-0.9
> -
>
>
> Thanks and Regards,
> Truong Phan
>
>
> P+ 61 2 8576 5771
> M   + 61 4 1463 7424
> Etroung.p...@team.telstra.com
> W  www.telstra.com
>
>
>
> -Original Message-
> From: Sean Owen [mailto:sro...@gmail.com]
> Sent: Wednesday, 2 April 2014 4:05 PM
> To: Mahout User List
> Subject: Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1
>
> Hm, OK something sounds wrong with your directory structure, given the
> warnings. I assumed this was changed. It could be that the .tar.gz
> distribution isn't quite correctly set up for building from source.
>
> The compilation here is nothing to do with Hadoop. You show a successful
> build; what's the part that fails?
>
> On Wed, Apr 2, 2014 at 6:59 AM, Phan, Truong Q <
> troung.p...@team.telstra.com> wrote:
> > Where did I modifying the build?
> > Here are my steps of the build.
> > I got the source from one of the official mirror website and build it.
> > The only one exception here is that I am using the Cloudera CDH 5.0.
> > This latest CDHv5.0 might not work with Mahout v0.9.
> >
> > ++
> > Install Mahout
> >
> > $  javac -version
> > javac 1.6.0_32
> >
> > $ cd /ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout
> >
> > $ wget
> > http://mirror.mel.bkb.net.au/pub/apache/mahout/0.9/mahout-distribution
> > -0.9-src.tar.gz
> >
> > $ mv mahout-distribution-0.9 mahout-distribution-0.9.old
> >
> > $ tar xvf mahout-distribution-0.9-src.tar.gz
> >
> > $ cd mahout-distribution-0.9
> >
> > $ ls
> > bin  buildtools  core  distribution  examples  integration
> > LICENSE.txt  math  math-scala  NOTICE.txt  pom.xml  README.txt  src
> >
> > $ mvn clean install -Dhadoop2.version=2.2.0-cdh5.0.0-beta-1
> -DskipTests=true [INFO] Scanning for projects...
> > 
> >
> > [INFO]
> > --
> > --
> > [INFO] Reactor Summary:
> > [INFO]
> > [INFO] Mahout Build Tools  SUCCESS [
> > 7.235 s] [INFO] Apache Mahout .
> > SUCCESS [  1.017 s] [INFO] Mahout Math
> > ... SUCCESS [15:46 min] [INFO]
> > Mahout Core ... SUCCESS [24:29
> > min] [INFO] Mahout Integration 
> > SUCCESS [03:38 min] [INFO] Mahout Examples
> > ... SUCCESS [02:40 min] [INFO] Mahout
> > Release Package  SUCCESS [  0.075 s]
> > [INFO] Mahout Math/Scala wrappers .... SUCCESS
> > [01:12 min] [INFO]
> > --
> > --
> > [INFO] BUILD SUCCESS
> > [INFO]
> > --
> > --
> > [INFO] Total time: 47:57 min
> > [INFO] Finished at: 2014-04-02T15:07:50+10:00 [INFO] Final Memory:
> > 49M/288M [INFO]
> > --
> > --
> >
> >
> > Thanks and Regards,
> > Truong Phan
> >
> >
> > P+ 61 2 8576 5771
> > M   + 61 4 1463 7424
> > Etroung.p...@team.telstra.com
> > W  www.telstra.com
> >
> >
> >
> > -Original Message-
> > From: Sean Owen [mailto:sro...@gmail.com]
> > Sent: Wednesday, 2 April 2014 3:33 PM
> > To: Mahout User List
> > Subject: Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1
> >
> > This may be getting to you're-on-your-own-territory since you're
> modifying the build. This error means your directory structure doesn't
> match up with declarations. You said somewhere that the parent of module X
> was Y, but the location given points to the pom of a module that isn't Y.
> >
> > On Wed, Apr 2, 2014 at 5:28 AM, Phan, Truong Q <
> troung.p...@team.telstra.com> wrote:
> >> Hi Sean,
> >>
> >> I am trying to build the Mahout again and got some WARNINGs so far.
> >> Can you give me some hints what I have done wrong here?
> >>
> >> Thanks for your help so far.
>

Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1

2014-04-01 Thread Sean Owen

Hm, OK something sounds wrong with your directory structure, given the
warnings. I assumed this was changed. It could be that the .tar.gz
distribution isn't quite correctly set up for building from source.

The compilation here is nothing to do with Hadoop. You show a
successful build; what's the part that fails?

On Wed, Apr 2, 2014 at 6:59 AM, Phan, Truong Q
 wrote:
> Where did I modifying the build?
> Here are my steps of the build.
> I got the source from one of the official mirror website and build it.
> The only one exception here is that I am using the Cloudera CDH 5.0.
> This latest CDHv5.0 might not work with Mahout v0.9.
>
> ++
> Install Mahout
>
> $  javac -version
> javac 1.6.0_32
>
> $ cd /ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout
>
> $ wget 
> http://mirror.mel.bkb.net.au/pub/apache/mahout/0.9/mahout-distribution-0.9-src.tar.gz
>
> $ mv mahout-distribution-0.9 mahout-distribution-0.9.old
>
> $ tar xvf mahout-distribution-0.9-src.tar.gz
>
> $ cd mahout-distribution-0.9
>
> $ ls
> bin  buildtools  core  distribution  examples  integration  LICENSE.txt  math 
>  math-scala  NOTICE.txt  pom.xml  README.txt  src
>
> $ mvn clean install -Dhadoop2.version=2.2.0-cdh5.0.0-beta-1 -DskipTests=true 
> [INFO] Scanning for projects...
> 
>
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Mahout Build Tools  SUCCESS [  7.235 s]
> [INFO] Apache Mahout . SUCCESS [  1.017 s]
> [INFO] Mahout Math ... SUCCESS [15:46 min]
> [INFO] Mahout Core ... SUCCESS [24:29 min]
> [INFO] Mahout Integration  SUCCESS [03:38 min]
> [INFO] Mahout Examples ... SUCCESS [02:40 min]
> [INFO] Mahout Release Package  SUCCESS [  0.075 s]
> [INFO] Mahout Math/Scala wrappers  SUCCESS [01:12 min]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time: 47:57 min
> [INFO] Finished at: 2014-04-02T15:07:50+10:00
> [INFO] Final Memory: 49M/288M
> [INFO] 
> 
>
>
> Thanks and Regards,
> Truong Phan
>
>
> P+ 61 2 8576 5771
> M   + 61 4 1463 7424
> Etroung.p...@team.telstra.com
> W  www.telstra.com
>
>
>
> -Original Message-
> From: Sean Owen [mailto:sro...@gmail.com]
> Sent: Wednesday, 2 April 2014 3:33 PM
> To: Mahout User List
> Subject: Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1
>
> This may be getting to you're-on-your-own-territory since you're modifying 
> the build. This error means your directory structure doesn't match up with 
> declarations. You said somewhere that the parent of module X was Y, but the 
> location given points to the pom of a module that isn't Y.
>
> On Wed, Apr 2, 2014 at 5:28 AM, Phan, Truong Q  
> wrote:
>> Hi Sean,
>>
>> I am trying to build the Mahout again and got some WARNINGs so far.
>> Can you give me some hints what I have done wrong here?
>>
>> Thanks for your help so far.

Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1

2014-04-01 Thread Sean Owen

This may be getting to you're-on-your-own-territory since you're
modifying the build. This error means your directory structure doesn't
match up with declarations. You said somewhere that the parent of
module X was Y, but the location given points to the pom of a module
that isn't Y.

On Wed, Apr 2, 2014 at 5:28 AM, Phan, Truong Q
 wrote:
> Hi Sean,
>
> I am trying to build the Mahout again and got some WARNINGs so far.
> Can you give me some hints what I have done wrong here?
>
> Thanks for your help so far.

Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1

2014-04-01 Thread Sean Owen

reduce-client-app.jar 
> -> hadoop-mapreduce-client-app-2.2.0-cdh5.0.0-beta-1.jar
> -rw-r--r--  1 root root  656300 Oct 28 11:30 
> hadoop-mapreduce-client-common-2.2.0-cdh5.0.0-beta-1.jar
> lrwxrwxrwx  1 root root  56 Feb  5 07:05 
> hadoop-mapreduce-client-common.jar -> 
> hadoop-mapreduce-client-common-2.2.0-cdh5.0.0-beta-1.jar
> -rw-r--r--  1 root root 1455612 Oct 28 11:30 
> hadoop-mapreduce-client-core-2.2.0-cdh5.0.0-beta-1.jar
> lrwxrwxrwx  1 root root  54 Feb  5 07:05 hadoop-mapreduce-client-core.jar 
> -> hadoop-mapreduce-client-core-2.2.0-cdh5.0.0-beta-1.jar
> -rw-r--r--  1 root root  117249 Oct 28 11:30 
> hadoop-mapreduce-client-hs-2.2.0-cdh5.0.0-beta-1.jar
> lrwxrwxrwx  1 root root  52 Feb  5 07:05 hadoop-mapreduce-client-hs.jar 
> -> hadoop-mapreduce-client-hs-2.2.0-cdh5.0.0-beta-1.jar
> -rw-r--r--  1 root root4086 Oct 28 11:30 
> hadoop-mapreduce-client-hs-plugins-2.2.0-cdh5.0.0-beta-1.jar
> lrwxrwxrwx  1 root root  60 Feb  5 07:05 
> hadoop-mapreduce-client-hs-plugins.jar -> 
> hadoop-mapreduce-client-hs-plugins-2.2.0-cdh5.0.0-beta-1.jar
> -rw-r--r--  1 root root   35237 Oct 28 11:30 
> hadoop-mapreduce-client-jobclient-2.2.0-cdh5.0.0-beta-1.jar
> -rw-r--r--  1 root root 1434812 Oct 28 11:30 
> hadoop-mapreduce-client-jobclient-2.2.0-cdh5.0.0-beta-1-tests.jar
> lrwxrwxrwx  1 root root  59 Feb  5 07:05 
> hadoop-mapreduce-client-jobclient.jar -> 
> hadoop-mapreduce-client-jobclient-2.2.0-cdh5.0.0-beta-1.jar
> -rw-r--r--  1 root root   21566 Oct 28 11:30 
> hadoop-mapreduce-client-shuffle-2.2.0-cdh5.0.0-beta-1.jar
> lrwxrwxrwx  1 root root  57 Feb  5 07:05 
> hadoop-mapreduce-client-shuffle.jar -> 
> hadoop-mapreduce-client-shuffle-2.2.0-cdh5.0.0-beta-1.jar
> -rw-r--r--  1 root root  270285 Oct 28 11:30 
> hadoop-mapreduce-examples-2.2.0-cdh5.0.0-beta-1.jar
> lrwxrwxrwx  1 root root  51 Feb  5 07:05 hadoop-mapreduce-examples.jar -> 
> hadoop-mapreduce-examples-2.2.0-cdh5.0.0-beta-1.jar
> -rw-r--r--  1 root root  277597 Oct 28 11:30 
> hadoop-rumen-2.2.0-cdh5.0.0-beta-1.jar
> lrwxrwxrwx  1 root root  38 Feb  5 07:05 hadoop-rumen.jar -> 
> hadoop-rumen-2.2.0-cdh5.0.0-beta-1.jar
> -rw-r--r--  1 root root  102813 Oct 28 11:30 
> hadoop-streaming-2.2.0-cdh5.0.0-beta-1.jar
> lrwxrwxrwx  1 root root  42 Feb  5 07:05 hadoop-streaming.jar -> 
> hadoop-streaming-2.2.0-cdh5.0.0-beta-1.jar
> drwxr-xr-x  2 root root4096 Feb  5 07:05 lib
> drwxr-xr-x  2 root root4096 Feb  5 07:05 sbin
>
> +
>
>
> Thanks and Regards,
> Truong Phan
>
>
> P+ 61 2 8576 5771
> M   + 61 4 1463 7424
> Etroung.p...@team.telstra.com
> W  www.telstra.com
>
>
>
> -Original Message-
> From: Sean Owen [mailto:sro...@gmail.com]
> Sent: Monday, 31 March 2014 7:05 PM
> To: Mahout User List
> Subject: RE: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1
>
> But you have a bunch of Hadoop 0.20 jars on your classpath! Definitely a 
> problem. Those should not be there.
> On Mar 31, 2014 7:09 AM, "Phan, Truong Q" 
> wrote:
>
>> Yes, I did rebuild it.
>>
>> oracle@bpdevdmsdbs01: 
>> /ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout/mahout-distrib
>> ution-0.9
>> -
>> $ mvn clean install -Dhadoop2.version=2.2.0-cdh5.0.0-beta-1
>> -DskipTests=true
>> [INFO] Scanning for projects...
>> 
>> [INFO]
>> --
>> --
>> [INFO] Reactor Summary:
>> [INFO]
>> [INFO] Mahout Build Tools  SUCCESS [
>>  8.215 s]
>> [INFO] Apache Mahout . SUCCESS [
>>  1.158 s]
>> [INFO] Mahout Math ... SUCCESS
>> [16:21 min] [INFO] Mahout Core ...
>> SUCCESS [26:21 min] [INFO] Mahout Integration
>>  SUCCESS [03:55 min] [INFO] Mahout
>> Examples ... SUCCESS [02:54 min]
>> [INFO] Mahout Release Package  SUCCESS [
>>  0.084 s]
>> [INFO] Mahout Math/Scala wrappers  SUCCESS
>> [01:16 min] [INFO]
>> --
>> --
>> [INFO] BUILD SUCCESS
>> [INFO]
>> --
>> --
>> [INFO] Total time: 50:59 min
>> [INFO] Finished at: 2014-03-31T14:25:27+10:00 [INFO] Final Memory:
>> 47M/250M [IN

RE: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1

2014-03-31 Thread Sean Owen

But you have a bunch of Hadoop 0.20 jars on your classpath! Definitely a
problem. Those should not be there.
On Mar 31, 2014 7:09 AM, "Phan, Truong Q" 
wrote:

> Yes, I did rebuild it.
>
> oracle@bpdevdmsdbs01: 
> /ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout/mahout-distribution-0.9
> -
> $ mvn clean install -Dhadoop2.version=2.2.0-cdh5.0.0-beta-1
> -DskipTests=true
> [INFO] Scanning for projects...
> 
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Mahout Build Tools  SUCCESS [
>  8.215 s]
> [INFO] Apache Mahout . SUCCESS [
>  1.158 s]
> [INFO] Mahout Math ... SUCCESS [16:21
> min]
> [INFO] Mahout Core ... SUCCESS [26:21
> min]
> [INFO] Mahout Integration  SUCCESS [03:55
> min]
> [INFO] Mahout Examples ... SUCCESS [02:54
> min]
> [INFO] Mahout Release Package  SUCCESS [
>  0.084 s]
> [INFO] Mahout Math/Scala wrappers  SUCCESS [01:16
> min]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 50:59 min
> [INFO] Finished at: 2014-03-31T14:25:27+10:00
> [INFO] Final Memory: 47M/250M
> [INFO]
> 
>
>
> Thanks and Regards,
> Truong Phan
>
>
> P+ 61 2 8576 5771
> M   + 61 4 1463 7424
> Etroung.p...@team.telstra.com
> W  www.telstra.com
>
>
> -Original Message-
> From: Andrew Musselman [mailto:andrew.mussel...@gmail.com]
> Sent: Monday, 31 March 2014 2:44 PM
> To: user@mahout.apache.org
> Subject: Re: Mahout v0.9 is not working with 2.2.0-cdh5.0.0-beta-1
>
> Have you rebuilt Mahout for your version?  We're not supporting Hadoop
> version two yet.
>
> See here for some direction:
> http://mail-archives.us.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCANg8BGD8Cm_=ESecQQ5mDL+6ybbNrR1Ce7i=pkuimxmcktw...@mail.gmail.com%3E
>
> > On Mar 30, 2014, at 7:28 PM, "Phan, Truong Q" <
> troung.p...@team.telstra.com> wrote:
> >
> > Hi
> >
> > Does Mahout v0.9 supports Cloudera Hadoop v5 (2.2.0-cdh5.0.0-beta-1)?
> > I have managed to installed and run all test cases under the Mahout v0.9
> without any issue.
> > Please see below for the evident of the test cases.
> > However I have no success to run the example from
> http://girlincomputerscience.blogspot.com.au/2010/11/apache-mahout.htmland 
> got the following errors.
> > Note: I have set the CLASSPATH to point to all of Mahout’s jar files.
> >
> > 
> > $ env | grep CLASS
> > CLASSPATH=:/usr/lib/hadoop-0.20-mapreduce/lib:/usr/lib/hadoop-0.20-map
> > reduce/lib:/ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout/mah
> > out-distribution-0.9/core/target/mahout-core-0.9.jar:/ora/db002/stg001
> > /BDMSL1D/hadoop/nem-dms/devices/mahout/mahout-distribution-0.9/core/ta
> > rget/mahout-core-0.9-job.jar:/ora/db002/stg001/BDMSL1D/hadoop/nem-dms/
> > devices/mahout/mahout-distribution-0.9/core/target/mahout-core-0.9-sou
> > rces.jar:/ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout/mahou
> > t-distribution-0.9/core/target/mahout-core-0.9-tests.jar:/ora/db002/st
> > g001/BDMSL1D/hadoop/nem-dms/devices/mahout/mahout-distribution-0.9/mat
> > h/target/mahout-math-0.9.jar:/ora/db002/stg001/BDMSL1D/hadoop/nem-dms/
> > devices/mahout/mahout-distribution-0.9/math/target/mahout-math-0.9-sou
> > rces.jar:/ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout/mahou
> > t-distribution-0.9/math/target/mahout-math-0.9-tests.jar:/ora/db002/st
> > g001/BDMSL1D/hadoop/nem-dms/devices/mahout/mahout-distribution-0.9/int
> > egration/target/mahout-integration-0.9.jar:/ora/db002/stg001/BDMSL1D/h
> > adoop/nem-dms/devices/mahout/mahout-distribution-0.9/integration/targe
> > t/mahout-integration-0.9-sources.jar
> >
> > $ export
> > MAHOUT_HOME=/ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout/ma
> > hout-distribution-0.9
> > $ export PATH=$MAHOUT_HOME/bin:$PATH
> >
> > oracle@bpdevdmsdbs01:BDMSSI1D1 
> > /ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout/mahout-distrib
> > ution-0.9/nem-dms - $ mahout recommenditembased --input mydata.dat
> > --usersFile user.dat --numRecommendations 2 --output output/
> > --similarityClassname SIMILARITY_PEARSON_CORRELATION Running on
> > hadoop, using /usr/lib/hadoop-0.20-mapreduce/bin/hadoop and
> > HADOOP_CONF_DIR=
> > MAHOUT-JOB:
> > /ora/db002/stg001/BDMSL1D/hadoop/nem-dms/devices/mahout/mahout-distrib
> > ution-0.9/examples/target/mahout-examples-0.9-job.jar
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/apache/hadoop/util/PlatformName
> > Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.util.PlatformName
>

Re: Profiling with visualvm

2014-03-30 Thread Sean Owen

Profiled what exactly, a Hadoop job? If you profile a client, you aren't
learning anything about the work, but just that the client process is
blocked waiting for Hadoop jobs to complete.
On Mar 30, 2014 10:08 AM, "Mahmood Naderan"  wrote:

> Hi,
> I  profiled the Mahout command with visualvm and saw many threads. Some of
> them are related to the profiler and some other are communication threads.
> Interesting thing is that, the main thread is always in sleep state!
>
> From the thread dump (which has been attached), the owner is Mahout. Isn't
> that strange?
>
>
> Regards,
> Mahmood
>

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-06 Thread Sean Owen

If I'm right, then it will cause compile errors, but then, you just
fix those by replacing some Guava constructs with equivalent Java or
older Guava code. IIRC it is fairly trivial.

And in fact probably should not use Guava 12+ methods for this reason
even if compiling against 12+. And in fact I thought someone cleaned
that up...

On Thu, Mar 6, 2014 at 3:34 PM, Kevin Moulart  wrote:
> Ok so should I try and recompile and change the guava version to 11.0.2 in
> the pom ?
>
> Kévin Moulart
>
>
> 2014-03-06 16:26 GMT+01:00 Sean Owen :
>
>> That's gonna be a Guava version problem. I have seen variants of this
>> for a while. Hadoop still uses 11.0.2 even in HEAD and you can often
>> get away with using a later version in a project like this, even
>> though code that executes on Hadoop will use an older Guava than you
>> compiled against. This is an example of that gotcha. I think it may be
>> necessary to force Mahout to use 11.0.2 and change this code.
>>
>> I am having deja vu like this has come up before too.
>>
>>
>>
>>
>>
>> On Thu, Mar 6, 2014 at 3:23 PM, Kevin Moulart 
>> wrote:
>> > Hi thanks very much it seems to have worked !
>> > Compiling with "mvn clean package -Dhadoop2.version=2.0.0-cdh4.6.0" works
>> > and I no longer have the error, but then when running tests that used to
>> > work with previous install like trainAdaptativeLogistic and then
>> > ValidateAdaptativeLogistic, the first works but the second yields an
>> error :
>> >
>> > bin/mahout validateAdaptiveLogistic --input
>> > /mnt/hdfs/user/myCompany/Echant/echant300k_wh.csv --model
>> > /mnt/hdfs/user/myCompany/Echant/Models/echnat.model --auc --scores
>> > --confusion.
>> > 14/03/06 15:53:42 WARN driver.MahoutDriver: No
>> > validateAdaptiveLogistic.props found on classpath, will use command-line
>> > arguments only
>> > Exception in thread "main" java.lang.NoSuchMethodError:
>> > com.google.common.collect.Queues.newArrayDeque()Ljava/util/ArrayDeque;
>> > at org.apache.mahout.math.stats.GroupTree$1.(GroupTree.java:171)
>> >  at org.apache.mahout.math.stats.GroupTree.iterator(GroupTree.java:169)
>> > at org.apache.mahout.math.stats.GroupTree.access$300(GroupTree.java:14)
>> >  at org.apache.mahout.math.stats.GroupTree$2.iterator(GroupTree.java:317)
>> > at org.apache.mahout.math.stats.TDigest.add(TDigest.java:105)
>> >  at org.apache.mahout.math.stats.TDigest.add(TDigest.java:88)
>> > at org.apache.mahout.math.stats.TDigest.add(TDigest.java:76)
>> >  at
>> >
>> org.apache.mahout.math.stats.OnlineSummarizer.add(OnlineSummarizer.java:57)
>> > at
>> >
>> org.apache.mahout.classifier.sgd.ValidateAdaptiveLogistic.mainToOutput(ValidateAdaptiveLogistic.java:107)
>> >  at
>> >
>> org.apache.mahout.classifier.sgd.ValidateAdaptiveLogistic.main(ValidateAdaptiveLogistic.java:63)
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >  at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> > at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >  at java.lang.reflect.Method.invoke(Method.java:606)
>> > at
>> >
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
>> >  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
>> > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >  at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > at java.lang.reflect.Method.invoke(Method.java:606)
>> >  at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>> >
>> > I'll try some other tests to see what's working and what's not.
>> >
>> >
>> >
>> > 2014-03-06 15:58 GMT+01:00 Gokhan Capan :
>> >
>> >> Kevin,
>> >>
>> >>
>> >> From trunk, can you build mahout for hadoop2 using this command:
>> >>
>> >> mvn clean package -DskipTests=true
>> -Dhadoop2.version=
>> >>
>> >>
>> >> Then can you verify that you have the right hadoop jars with the
>> following
>>

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-06 Thread Sean Owen

/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
>> > and HADOOP_CONF_DIR=/etc/hadoop/conf
>> > MAHOUT-JOB:
>> >
>> >
>> /home/myCompany/Downloads/mahout9/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
>> > Exception in thread "main" java.lang.NoSuchMethodError:
>> > org.apache.hadoop.util.ProgramDriver.driver([Ljava/lang/String;)V
>> > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:122)
>> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> >  at
>> >
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > at java.lang.reflect.Method.invoke(Method.java:606)
>> >  at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>> >
>> > I even tried with :
>> > export HADOOP_HOME=/.../hadoop,
>> > export HADOOP_HOME=/.../hadoop-0.20-mapreduce
>> > export HADOOP_HOME=/.../hadoop-mapreduce
>> >
>> > And it still gives me the same result.
>> >
>> > And recompiling with  2.0.0 or
>> >  2.0.0-mr1-cdh4.6.0 didn't work.
>> >
>> > Any idea ?
>> >
>> >
>> >
>> > 2014-03-05 22:42 GMT+01:00 Andrew Musselman > >:
>> >
>> > > I mean "balance the risk aversion against the value of new features"
>> duh.
>> > >
>> > >
>> > > On Wed, Mar 5, 2014 at 1:39 PM, Andrew Musselman <
>> > > andrew.mussel...@gmail.com
>> > > > wrote:
>> > >
>> > > > Yeah, for sure; balancing clients' risk aversion to technical
>> features
>> > is
>> > > > why we often recommend vendor solutions.
>> > > >
>> > > > Having a little button to choose a newer version of a component in
>> the
>> > > > Manager UI (even with a confirmation dialog that said "Are you sure?
>> > Are
>> > > > you crazy?") would be more palatable to some teams than installing
>> > > > tarballs, is what I'm getting at.
>> > > >
>> > > >
>> > > > On Wed, Mar 5, 2014 at 1:30 PM, Sean Owen  wrote:
>> > > >
>> > > >> You can always install whatever version of anything on your cluster
>> > > >> that you want. It may or may not work, but often happens to, at
>> least
>> > > >> for whatever you need it to do.
>> > > >>
>> > > >> It's just the same as it is without a packaged distribution -- dump
>> > > >> new tarballs and cross your fingers. Nothing is weird or different
>> > > >> about the setup or layout. That is the "here be dragons" solution,
>> > > >> already
>> > > >>
>> > > >> You go with support from a packaged distribution when you want a
>> "here
>> > > >> be no dragons" solution. Everything else is by definition already
>> > > >> something you can and should do yourself outside of a packaged
>> > > >> distribution. And really -- you freely can, and it's not hard, if
>> you
>> > > >> know what you are doing.
>> > > >>
>> > > >> On Wed, Mar 5, 2014 at 9:15 PM, Andrew Musselman
>> > > >>  wrote:
>> > > >> > Feels like just yesterday :)
>> > > >> >
>> > > >> > Consider this a feature request to have more flexible component
>> > > >> versioning,
>> > > >> > even with a caveat/"here be dragons" warning.  I know that
>> > complicates
>> > > >> > things but people do use your releases a long time.  I personally
>> > > >> wished I
>> > > >> > could upgrade Pig on CDH 4 for new features but there was no
>> simple
>> > > way
>> > > >> on
>> > > >> > a managed cluster.
>> > > >> >
>> > > >> >
>> > > >> > On Wed, Mar 5, 2014 at 12:12 PM, Sean Owen 
>> > wrote:
>> > > >> >
>> > > >> >> I don't understand this -- CDH always bundles the latest release.
>> > > >> >>
>> > > >> >> You know that CDH4 was released in July 2012, right? So it
>> included
>> > > >> >> 0.7 + patches. CDH5 includes 0.8 because 0.9 was released about a
>> > > >> >> month after it began beta 2.
>> > > >> >>
>> > > >> >> CDH follows semantic versioning and won't introduce changes that
>> > are
>> > > >> >> not backwards-compatible in a minor version update. 0.x releases
>> of
>> > > >> >> Mahout act like major version changes -- not backwards
>> compatible.
>> > So
>> > > >> >> 4.x will always be 0.7 and 5.x will always be 0.8.
>> > > >> >>
>> > > >> >> On Wed, Mar 5, 2014 at 5:34 PM, Dmitriy Lyubimov <
>> > dlie...@gmail.com>
>> > > >> >> wrote:
>> > > >> >> > On Wed, Mar 5, 2014 at 9:08 AM, Sean Owen 
>> > > wrote:
>> > > >> >> >
>> > > >> >> >> I don't follow what here makes you say they are "cut down"
>> > > releases?
>> > > >> >> >>
>> > > >> >> >
>> > > >> >> > meaning it seems to be pretty much 2 releases behind the
>> > official.
>> > > >> But i
>> > > >> >> > definitely don't follow CDH developments in this department,
>> you
>> > > >> seem in
>> > > >> >> a
>> > > >> >> > better position to explain the existing patchlevel there so I
>> > defer
>> > > >> to
>> > > >> >> you
>> > > >> >> > to explain why this patchlevel is not there.
>> > > >> >> >
>> > > >> >> > -d
>> > > >> >>
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Kévin Moulart
>> > GSM France : +33 7 81 06 10 10
>> > GSM Belgique : +32 473 85 23 85
>> > Téléphone fixe : +32 2 771 88 45
>> >
>>
>
>
>
> --
> Kévin Moulart
> GSM France : +33 7 81 06 10 10
> GSM Belgique : +32 473 85 23 85
> Téléphone fixe : +32 2 771 88 45

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen

You can always install whatever version of anything on your cluster
that you want. It may or may not work, but often happens to, at least
for whatever you need it to do.

It's just the same as it is without a packaged distribution -- dump
new tarballs and cross your fingers. Nothing is weird or different
about the setup or layout. That is the "here be dragons" solution,
already

You go with support from a packaged distribution when you want a "here
be no dragons" solution. Everything else is by definition already
something you can and should do yourself outside of a packaged
distribution. And really -- you freely can, and it's not hard, if you
know what you are doing.

On Wed, Mar 5, 2014 at 9:15 PM, Andrew Musselman
 wrote:
> Feels like just yesterday :)
>
> Consider this a feature request to have more flexible component versioning,
> even with a caveat/"here be dragons" warning.  I know that complicates
> things but people do use your releases a long time.  I personally wished I
> could upgrade Pig on CDH 4 for new features but there was no simple way on
> a managed cluster.
>
>
> On Wed, Mar 5, 2014 at 12:12 PM, Sean Owen  wrote:
>
>> I don't understand this -- CDH always bundles the latest release.
>>
>> You know that CDH4 was released in July 2012, right? So it included
>> 0.7 + patches. CDH5 includes 0.8 because 0.9 was released about a
>> month after it began beta 2.
>>
>> CDH follows semantic versioning and won't introduce changes that are
>> not backwards-compatible in a minor version update. 0.x releases of
>> Mahout act like major version changes -- not backwards compatible. So
>> 4.x will always be 0.7 and 5.x will always be 0.8.
>>
>> On Wed, Mar 5, 2014 at 5:34 PM, Dmitriy Lyubimov 
>> wrote:
>> > On Wed, Mar 5, 2014 at 9:08 AM, Sean Owen  wrote:
>> >
>> >> I don't follow what here makes you say they are "cut down" releases?
>> >>
>> >
>> > meaning it seems to be pretty much 2 releases behind the official. But i
>> > definitely don't follow CDH developments in this department, you seem in
>> a
>> > better position to explain the existing patchlevel there so I defer to
>> you
>> > to explain why this patchlevel is not there.
>> >
>> > -d
>>

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen

I don't understand this -- CDH always bundles the latest release.

You know that CDH4 was released in July 2012, right? So it included
0.7 + patches. CDH5 includes 0.8 because 0.9 was released about a
month after it began beta 2.

CDH follows semantic versioning and won't introduce changes that are
not backwards-compatible in a minor version update. 0.x releases of
Mahout act like major version changes -- not backwards compatible. So
4.x will always be 0.7 and 5.x will always be 0.8.

On Wed, Mar 5, 2014 at 5:34 PM, Dmitriy Lyubimov  wrote:
> On Wed, Mar 5, 2014 at 9:08 AM, Sean Owen  wrote:
>
>> I don't follow what here makes you say they are "cut down" releases?
>>
>
> meaning it seems to be pretty much 2 releases behind the official. But i
> definitely don't follow CDH developments in this department, you seem in a
> better position to explain the existing patchlevel there so I defer to you
> to explain why this patchlevel is not there.
>
> -d

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen

I don't follow what here makes you say they are "cut down" releases?
They are release plus patches not release minus patches.

The question is not about how to use 0.7, but how to use 1.0-SNAPSHOT.
Why would switching to the "official" 0.7 release help?

I think the answer is "you build Mahout for Hadoop 2". right? This has
always been the case. Mahout has always been Hadoop 1, with 2 support
"on the side".

On Wed, Mar 5, 2014 at 5:04 PM, Dmitriy Lyubimov  wrote:
> Yeah. it would seem CDH releases of Mahout produce some sort of cut-down
> version of such. I suggest to switch to official release tarbal (or write
> to Cloudera support about it).
>

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen

CDH 4.5 and 4.6 are both 0.7 + patches. Neither contains 0.8, since it
has (tiny) breaking changes vs 0.7 and this is a minor version update.
CDH5 contains 0.8 + patches. I did not say CDH4 has 0.8 -- re-read the
message of mine that was quoted.

http://archive.cloudera.com/cdh4/cdh/4/mahout-0.7-cdh4.5.0.CHANGES.txt
http://archive.cloudera.com/cdh4/cdh/4/mahout-0.7-cdh4.6.0.CHANGES.txt

Those two patches are not in CDH 4.x, no.

The non-upstream changes are basically all internal packaging stuff,
and that can include modifying dependency versions to harmonize with
the rest of the platform. That's the sense in which it works with
Hadoop 2.

I don't think the change you cite is sufficient to work with Hadoop 2.
You also, for example, must build against the Hadoop 2 profile in
Mahout in Maven. For that you do not need the CDH repo even, just
point to the Hadoop 2.x release if you like.

I know there has been a patch in even just the past few weeks that
makes it work even better with 2.x. So I suppose I would build from
HEAD if possible to take advantage.

On Wed, Mar 5, 2014 at 4:30 PM, Suneel Marthi  wrote:
> Not sure if the CDH4 patches on top of 0.7 has fixes for M-1067 and M-1098 
> which address the issues u r seeing.
>
>
>
> The second part of the issue u r seeing with Mahout 0.9 distro seems to be 
> related to how u set it up on CDH4. I apologize for not being helpful here as 
> I am not a CDH4 user or expert.
>
> Sean?
>

Re: Mahout on Spark?

2014-02-19 Thread Sean Owen

To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.

You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas  wrote:
> +100 for this, different execution engines, like the direction  pig and 
> crunch take
>
> Sent from my iPhone
>
>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan  wrote:
>>
>> I imagine in Mahout offering an option to the users to select from
>> different execution engines (just like we currently do by giving M/R or
>> sequential options), and starting from Spark. I am not sure what changes
>> needed in the codebase, though. Maybe following MLI (or alike) and
>> implementing some more stuff, such as common interfaces for iterating over
>> data (the M/R way and the Spark way).
>>
>> IMO, another effort might be porting pre-online machine learning (such
>> transforming text into vector based on the dictionary generated by
>> seq2sparse before), machine learning based on mini-batches, and streaming
>> summarization stuff in Mahout to Spark-Streaming.
>>
>> Best,
>> Gokhan
>>
>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov wrote:
>>
>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>> multiplicative pipelines that is capable of figuring different cost-based
>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>> representations and blocks but it is painfully slow, i really only doing it
>>> like couple nights in a month. It does not look like i will be doing it on
>>> company time any time soon (and even if i did, the company doesn't seem to
>>> be inclined to contribute anything I do anything new on their time). It is
>>> all painfully slow, there's no direct funding for it anywhere with no
>>> string attached. That probably will be primary reason why Mahout would not
>>> be able to get much traction compared to university-based contributions.
>>>
>>>
>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov >>> wrote:
>>>
 Unfortunately methinks the prospects of something like Mahout/MLLib merge
 seem very unlikely due to vastly diverged approach to the basics of
>>> linear
 algebra (and other things). Just like one cannot grow single tree out of
 two trunks -- not easily, anyway.

 It is fairly easy to port (and subsequently beat) MLib at this point from
 collection of algorithms point of view. But IMO goal should be more
 MLI-like first, and port second. And be very careful with concepts.
 Something that i so far don't see happening with MLib. MLib seems to be
 old-style Mahout-like rush to become a collection of basic algorithms
 rather than coherent foundation. Admittedly, i havent looked very
>>> closely.


 On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter >>> wrote:

> I'm also convinced that Spark is a superior platform for executing
> distributed ML algorithms. We've had a discussion about a change from
> Hadoop to another platform some time ago, but at that point in time it
>>> was
> not clear which of the upcoming dataflow processing systems (Spark,
> Hyracks, Stratosphere) would establish itself amongst the users. To me
>>> it
> seems pretty obvious that Spark made the race.
>
> I concur with Ted, it would be great to have the communities work
> together. I know that at least 4 mahout committers (including me) are
> already following Spark's mailinglist and actively participating in the
> discussions.
>
> What are the ideas how a fruitful cooperation look like?
>
> Best,
> Sebastian
>
> PS:
>
> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
> to Spark some time ago, but I haven't had time to test my code on a
>>> large
> dataset yet. I'd be happy to see someone help with that.
>
>
>
>
>
>
>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>
>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>> doing certain things, but we'd welcome as many Mahout devs as possible
>>> to
>> work together.
>>
>>
>> It may be too late, but perhaps a GSoC project to look at a port of
>>> some
>> stuff like co occurrence recommender and streaming k-means?
>>
>>
>>
>>
>> N
>> --
>> Sent from Mailbox for iPhone
>>
>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning 
>> wrote:
>>
>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>> nick.pentre...@gmail.com>wrote:
>>>
 My (admittedly heavily biased) view is Spark is a superior platform
 overall
 for ML. If the two communities can work together to leverage the
 strengths
>>

Re: Mahout on Spark?

2014-02-19 Thread Sean Owen

Agree that 'merging' is so infeasible as to not make sense. Mahout has
been ML on M/R and that's it's thing, which seems fine. IMHO this
project has been hurt by an active unwillingness to define scope, and
pretending it's helpful to have little bits of lots of ideas and
technologies.

I also don't see a point in trying to duplicate mllib. Just add to
mllib. It's Apache, etc. I also agree that being a bag of algorithms
is a bad idea and we have told the mllib folks as much FWIW.

The Spark / Databricks guys are the few qualified to manage
contributions to mllib, and are doing a heroic job of handling the
flood of PRs. (Does Matei sleep anymore?) But they're getting overrun,
and focused on getting the machinery of Spark really production-ready,
esp. on Hadoop. My concern about mllib in the short term is there
aren't enough expert brain cells to spare to manage the load of
production-izing work that mllib could use, because it's secondary to
core Spark. All the more reason I can't see, in practice, any spare
cycles available to do some kind of Mahout-integration anything.

(FWIW I have high hopes for mllib and assuming we can get some basic
stuff fixed we're going to replace M/R-based implementations with
Spark in the stuff I work on. still needs a decent RDF implementation.
But then again, so does Mahout :( )


On Wed, Feb 19, 2014 at 8:27 AM, Dmitriy Lyubimov  wrote:
> Unfortunately methinks the prospects of something like Mahout/MLLib merge
> seem very unlikely due to vastly diverged approach to the basics of linear
> algebra (and other things). Just like one cannot grow single tree out of
> two trunks -- not easily, anyway.
>
> It is fairly easy to port (and subsequently beat) MLib at this point from
> collection of algorithms point of view. But IMO goal should be more
> MLI-like first, and port second. And be very careful with concepts.
> Something that i so far don't see happening with MLib. MLib seems to be
> old-style Mahout-like rush to become a collection of basic algorithms
> rather than coherent foundation. Admittedly, i havent looked very closely.
>
>
> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter  wrote:
>
>> I'm also convinced that Spark is a superior platform for executing
>> distributed ML algorithms. We've had a discussion about a change from
>> Hadoop to another platform some time ago, but at that point in time it was
>> not clear which of the upcoming dataflow processing systems (Spark,
>> Hyracks, Stratosphere) would establish itself amongst the users. To me it
>> seems pretty obvious that Spark made the race.
>>
>> I concur with Ted, it would be great to have the communities work
>> together. I know that at least 4 mahout committers (including me) are
>> already following Spark's mailinglist and actively participating in the
>> discussions.
>>
>> What are the ideas how a fruitful cooperation look like?
>>
>> Best,
>> Sebastian
>>
>> PS:
>>
>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>> to Spark some time ago, but I haven't had time to test my code on a large
>> dataset yet. I'd be happy to see someone help with that.
>>
>>
>>
>>
>>
>>
>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>
>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>> doing certain things, but we'd welcome as many Mahout devs as possible to
>>> work together.
>>>
>>>
>>> It may be too late, but perhaps a GSoC project to look at a port of some
>>> stuff like co occurrence recommender and streaming k-means?
>>>
>>>
>>>
>>>
>>> N
>>> --
>>> Sent from Mailbox for iPhone
>>>
>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning 
>>> wrote:
>>>
>>>  On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
 nick.pentre...@gmail.com>wrote:

> My (admittedly heavily biased) view is Spark is a superior platform
> overall
> for ML. If the two communities can work together to leverage the
> strengths
> of Spark, and the large amount of good stuff in Mahout (as well as the
> fantastic depth of experience of Mahout devs) I think a lot can be
> achieved!
>
>  It makes a lot of sense that Spark would be better than Hadoop for ML
 purposes given that Hadoop was intended to do web-crawl kinds of things
 and
 Spark was intentionally built to support machine learning.
 Given that Spark has been announced by a majority of the Hadoop-based
 distribution vendors, it makes sense that maybe Mahout should jump in.
 I really would prefer it if the two communities (MLib/MLI and Mahout)
 could
 work more closely together.  There is a lot of good to be had on both
 sides.

>>>
>>

Re: [Edit] Approach for Clustering Data

2014-02-18 Thread Sean Owen

FYI, CDH5 includes version 0.8 + patches. But 0.9 should work fine
with CDH4. You do have to build with the Hadoop 2.x profile, as usual.

On Tue, Feb 18, 2014 at 2:06 PM, Ted Dunning  wrote:
> Bikash,
>
> Don't use that version.  Use a more recent release.  We can't help that
> Cloudera has an old version.
>
>
>
>
> On Tue, Feb 18, 2014 at 1:26 AM, Bikash Gupta wrote:
>
>> Suneel,
>>
>> Thanks for the information.
>>
>> I am using 0.7 packaged with CDH .
>>
>> On Tue, Feb 18, 2014 at 2:14 PM, Suneel Marthi 
>> wrote:
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tuesday, February 18, 2014 3:37 AM, Bikash Gupta <
>> bikash.gupt...@gmail.com> wrote:
>> >
>> > Ted/Peter,
>> >
>> > Thanks for the response.
>> >
>> > This is exactly what I am trying to achieve. May be I was not able to
>> > put my questions clearly.
>> >
>> > I am clustering on few variables of Customer/User(except their
>> > customer_id/user_id) and storing customer_id/user_id list in a
>> > separate place.
>> >
>> > Question) What is the approach to identify each member in each cluster
>> > by its unique id.
>> > Answer) I have to run a script post-clustering to map
>> > customer_id/user_id for the clustered output to identify the member
>> > uniquely.
>> >
>> >>> If u r working off of Mahout 0.9 u don't have to do that. The
>> Clustered output should display the vectors with the vectorid (user_id in
>> ur case) that belong to a specfic cluster along with the distance of that
>> vector from the cluster center.
>> >
>> > Correct me if I am wrong :)
>> >
>> >
>> > On Tue, Feb 18, 2014 at 10:53 AM, Ted Dunning 
>> wrote:
>> >> Bikash,
>> >>
>> >> Peter is just right.
>> >>
>> >> Yes, you can cluster on these few variables that you have.  Probably you
>> >> should translate location to x,y,z coordinates so that you don't have
>> >> strange geometry problems, but location, gender and age are quite
>> >> reasonable characteristics.  You will get a fairly weak clustering since
>> >> these characteristics actually tell very little about people, but it is
>> a
>> >> start.
>> >>
>> >> You *don't* want to cluster using user ID for exactly the reasons that
>> >> Peter mentioned.  Another way to put it is that the user ID tells you
>> >> absolutely nothing about the person and thus is not useful for the
>> >> clustering.
>> >>
>> >> You *do* have to retain the assignment of users to cluster and that
>> >> assignment is usually stored as a list of user ID's for each cluster.
>>  This
>> >> does not at all imply, however, that the user ID was used to *form* the
>> >> cluster.
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Feb 17, 2014 at 9:01 PM, Peter Jaumann <
>> peter.jauma...@gmail.com>wrote:
>> >>
>> >>> Bikash,
>> >>> As Ted pointed out already..
>> >>> You can cluster on all variables except your customer_id. That's your
>> >>> identifier.
>> >>> Customers within a cluster are 'similar'; how similar depends on the
>> >>> fidelity of your clustering.
>> >>> The clustering algorithm uses (you'll choose) some kind of distance, or
>> >>> similarity/dissimilarity
>> >>> measure (which one to use depends on the type of data you have). This
>> >>> measure will,
>> >>> eventually, determine how separate/how unique your clusters are. Goal
>> is to
>> >>> have your clusters distinct
>> >>> from each other but have the cluster members, within a cluster, as
>> similar
>> >>> as possible.
>> >>>
>> >>> In the output, each member in each cluster is uniquely identified by
>> it's
>> >>> customer_id, the cluster it belongs to,
>> >>> and a distance measure that shows (usually) how close, or not, the
>> >>> customer_id is from its cluster center.
>> >>>
>> >>> In terms of what you want to do, my assumption is that you'd like to
>> find
>> >>> out a structure, or patterns,
>> >>> within your customer base, based on a set of variables that you have.
>> This
>> >>> is often called a segmentation.
>> >>>
>> >>> Hope this helps! What you want to do is a pretty basic and
>> straight-forward
>> >>> application of clustering.
>> >>> Good luck,
>> >>> -Peter
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Feb 17, 2014 at 9:48 PM, Bikash Gupta <
>> bikash.gupt...@gmail.com
>> >>> >wrote:
>> >>>
>> >>> > Basically I am trying to achieve customer segmentation.
>> >>> >
>> >>> > Now to measure customer similarity within a cluster I need to
>> >>> > understand which two customer are similar.
>> >>> >
>> >>> > Assumption: To understand these customer uniquely I need to provide
>> >>> > their CustomerId
>> >>> >
>> >>> > Is my assumption correct? If yes then, will customerId affect the
>> >>> > clustering output
>> >>> >
>> >>> > If no then how can I identify customer uniquely
>> >>> >
>> >>> > On Tue, Feb 18, 2014 at 2:55 AM, Ted Dunning 
>> >>> > wrote:
>> >>> > > That really depends on what you want to do.
>> >>> > >
>> >>> > > What is it that you want?
>> >>> > >
>> >>> > >
>> >>> > > On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta <
>> >>> bikash.gupt...@gmail.com
>> >>> > >wrote:
>> >>> > >
>> >>> >

Re: get similar items

2014-02-12 Thread Sean Owen

Try LogLikelihoodSimilarity.

On Wed, Feb 12, 2014 at 9:06 AM, 12481...@qq.com <12481...@qq.com> wrote:
> Hi Sean, you said "It depends what ItemSimilarity you are using. "
> what kind of ItemSimilarity can work correctly without preference?
>
> thanks.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/get-similar-items-tp1568765p4116843.html
> Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Popularity of recommender items

2014-02-06 Thread Sean Owen

Agree - I thought by asking for most popular you meant to look for apple
pie.

Agree with you and Ted that the sum of similarity says something
interesting even if it is not popularity exactly.
On Feb 6, 2014 11:16 AM, "Pat Ferrel"  wrote:

> The problem with the usual preference count is that big hit items can be
> overwhelmingly popular. If you want to know which ones the most people saw
> and are likely to have an opinion about then this seems a good measure. But
> these hugely popular items may not differentiate taste.
>
> So we calculate the “important” taste indicators with LLR. The benefit of
> the similarity matrix is that it attempts to model the “important”
> cooccurrences.
>
> There is an affect of hugely popular items where they really say nothing
> about similarity of taste. Everyone likes motherhood and Apple pie so it
> doesn’t say much about us if we both do to. This is usually accounted for
> with something like TFIDF so I suppose another weighted popularity measure
> would be to run the preference matrix through TFIDF to de-weight
> non-differentiating preferences.
>
> On Feb 6, 2014, at 7:14 AM, Ted Dunning  wrote:
>
> If you look at the indicator matrix (cooccurrence reduced by LLR), you will
> usually have asymmetry due to limitations on the number of indicators per
> row.
>
> This will give you some interesting results when you look at the column
> sums.  I wouldn't call it popularity, but it is an interesting measure.
>
>
>
> On Thu, Feb 6, 2014 at 2:15 PM, Sean Owen  wrote:
>
> > I have always defined popularity as just the number of ratings/prefs,
> > yes. You could rank on some kind of 'net promoter score' -- good
> > ratings minus bad ratings -- though that becomes more like 'most
> > liked'.
> >
> > How do you get popularity from similarity -- similarity to what?
> > Ranking by sum of similarities seems more like a measure of how much
> > the item is the 'centroid' of all items. Not necessarily most popular
> > but 'least eccentric'.
> >
> >
> > On Thu, Feb 6, 2014 at 7:41 AM, Tevfik Aytekin  >
> > wrote:
> >> Well, I think what you are suggesting is to define popularity as being
> >> similar to other items. So in this way most popular items will be
> >> those which are most similar to all other items, like the centroids in
> >> K-means.
> >>
> >> I would first check the correlation between this definition and the
> >> standard one (that is, the definition of popularity as having the
> >> highest number of ratings). But my intuition is that they are
> >> different things. For example. an item might lie at the center in the
> >> similarity space but it might not be a popular item. However, there
> >> might still be some correlation, it would be interesting to check it.
> >>
> >> hope it helps
> >>
> >>
> >>
> >>
> >> On Wed, Feb 5, 2014 at 3:27 AM, Pat Ferrel 
> > wrote:
> >>> Trying to come up with a relative measure of popularity for items in a
> > recommender. Something that could be used to rank items.
> >>>
> >>> The user - item preference matrix would be the obvious thought. Just
> > add the number of preferences per item. Maybe transpose the preference
> > matrix (the temp DRM created by the recommender), then for each row
> vector
> > (now that a row = item) grab the number of non zero preferences. This
> > corresponds to the number of preferences, and would give one measure of
> > popularity. In the case where the items are not boolean you'd sum the
> > weights.
> >>>
> >>> However it might be a better idea to look at the item-item similarity
> > matrix. It doesn't need to be transposed and contains the "important"
> > similarities--as calculated by LLR for example. Here similarity means
> > similarity in which users preferred an item. So summing the non-zero
> > weights would give perhaps an even better relative "popularity" measure.
> > For the same reason clustering the similarity matrix would yield
> > "important" clusters.
> >>>
> >>> Anyone have intuition about this?
> >>>
> >>> I started to think about this because transposing the user-item matrix
> > seems to yield a fromat that cannot be sent directly into clustering.
> >
>
>

Re: Popularity of recommender items

2014-02-06 Thread Sean Owen

I have always defined popularity as just the number of ratings/prefs,
yes. You could rank on some kind of 'net promoter score' -- good
ratings minus bad ratings -- though that becomes more like 'most
liked'.

How do you get popularity from similarity -- similarity to what?
Ranking by sum of similarities seems more like a measure of how much
the item is the 'centroid' of all items. Not necessarily most popular
but 'least eccentric'.


On Thu, Feb 6, 2014 at 7:41 AM, Tevfik Aytekin  wrote:
> Well, I think what you are suggesting is to define popularity as being
> similar to other items. So in this way most popular items will be
> those which are most similar to all other items, like the centroids in
> K-means.
>
> I would first check the correlation between this definition and the
> standard one (that is, the definition of popularity as having the
> highest number of ratings). But my intuition is that they are
> different things. For example. an item might lie at the center in the
> similarity space but it might not be a popular item. However, there
> might still be some correlation, it would be interesting to check it.
>
> hope it helps
>
>
>
>
> On Wed, Feb 5, 2014 at 3:27 AM, Pat Ferrel  wrote:
>> Trying to come up with a relative measure of popularity for items in a 
>> recommender. Something that could be used to rank items.
>>
>> The user - item preference matrix would be the obvious thought. Just add the 
>> number of preferences per item. Maybe transpose the preference matrix (the 
>> temp DRM created by the recommender), then for each row vector (now that a 
>> row = item) grab the number of non zero preferences. This corresponds to the 
>> number of preferences, and would give one measure of popularity. In the case 
>> where the items are not boolean you'd sum the weights.
>>
>> However it might be a better idea to look at the item-item similarity 
>> matrix. It doesn't need to be transposed and contains the "important" 
>> similarities--as calculated by LLR for example. Here similarity means 
>> similarity in which users preferred an item. So summing the non-zero weights 
>> would give perhaps an even better relative "popularity" measure. For the 
>> same reason clustering the similarity matrix would yield "important" 
>> clusters.
>>
>> Anyone have intuition about this?
>>
>> I started to think about this because transposing the user-item matrix seems 
>> to yield a fromat that cannot be sent directly into clustering.

Re: Mahout 0.9 with cloudera

2014-02-06 Thread Sean Owen

Yeah that's the version that's bundled with 4.x. 5.x has basically 0.8
plus patches to work on MR2.

Mahout is not really something you have to install. Even though it
does get packaged and dumped onto the cluster nodes. Just use it
against your cluster -- it can be from a machine that isn't part of
the cluster.

The issue is just version compatibility. For CDH with classic MR1
('mapreduce' service enabled), I expect it just works. For CDH with
MR2 ('yarn' service enabled), I imagine you would have to compile
Mahout with the Hadoop 2.x profile. But beyond that I don't know of
any reason it would not just work.

On Thu, Feb 6, 2014 at 4:48 AM, Kevin Moulart  wrote:
> Hi everyone,
>
> Is there a simple way to install Mahout 0.9 on a cluster running Cloudera's
> CDH 4.5 ?
>
> When I try what they advise on their doc (yum install mahout on my CentOS 6
> node), it wants to install mahout version 0.7+22-1.cdh4.5.0.p0.14.el6.
>
> Thanks in advance !
>
> --
> Kévin Moulart
> GSM France : +33 7 81 06 10 10
> GSM Belgique : +32 473 85 23 85
> Téléphone fixe : +32 2 771 88 45

Re: Question about Pearson Correlation in non-Taste mode

2013-12-01 Thread Sean Owen

It's not an issue of how to be careful with sparsity and subtracting
means, although that's a valuable point in itself. The question is
what the mean is supposed to be.

You can't think of missing ratings as 0 in general, and the example
here shows why: you're acting as if most movies are hated. Instead
they are excluded from the computation entirely.

m_x should be 4.5 in the example here. That's consistent with
literature and the other implementations earlier in this project.

I don't know the Hadoop implementation well enough, and wasn't sure
from the comments above, whether it does end up behaving as if it's
"4.5" or "3". If it's not 4.5 I would call that a bug. Items that
aren't co-rated can't meaningfully be included in this computation.

On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning  wrote:
> Good point Amit.
>
> Not sure how much this matters.  It may be that
> PearsonCorrelationSimilarity is bad name that should be
> PearonInspiredCorrelationSimilarity.  My guess is that this implementation
> is lifted directly from the very early recommendation literature and is
> reflective of the way that it was used back then.

Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Sean Owen

Yes I looked at the impl here, and I think it is aging, since I'm not
sure Deneche had time to put in many bells or whistles at the start,
and not sure it's been touched much since.

My limited experience is that it generally does less clever stuff than
R, which in turn is less clever than sklearn et al. hence the gap in
results. There are lots of ways you can do better than the original
Breiman paper, which is what R sticks to mostly.

Weirdly I was just having a long conversation about this exact topic
today, since I'm working on an RDF implementation on Hadoop. (I think
it might be worth a new implementation after this much time, if one
were looking to revamp RDF on Hadoop and inject some new tricks. It
needs some different design choices.)

Anyway, the question was indeed which splits of an N-valued
categorical (nominal) variable to consider? because considering all
2^N-2 of them is not scalable, especially since I don't want any limit
on N.

There are easy, fast ways to figure out what splits to consider for
every other combination of categorical/numeric feature F predicting
categorical/numeric target T, but I couldn't find any magic for one:
categorical F predicting categorical T.

I ended up making up a heuristic that is at least linear in N, and I
wonder if anyone is a) interested in talking about this at all or b)
has the magic answer here.

So -- sort the values of the F by the entropy of T considered over
examples for just that value of F. Then consider splits based on
prefixes of that list. So if F = [a, b, c, d] and in order by entropy
of T they are [b, c, a, d] then consider rules like F in {b}, F in
{b,c}, F in {b,c,a}.

This isn't a great heuristic but seems to work well in practice.

I suppose it's this and a lot of other little tricks like that that
could improve this or any other implementation -- RDF makes speed and
accuracy pretty trade-off-able, so anything that makes things faster
can make it instead more accurate or vice versa.

Definitely an interesting topic I'd be interested to cover with anyone
building RDFs now.

On Fri, Oct 18, 2013 at 7:26 PM, DeBarr, Dave  wrote:
> Another difference...
>
> R's randomForest package (which RRF is based on) evaluates subsets of values 
> when partitioning nominal values.  [This is why it complains if there are 
> more than 32 distinct values for a nominal variable.]
>
> For example, if our nominal variable has values { A, B, C, D }, the package 
> will consider "in { A, C }" versus "not in { A, C }" as a partition candidate.
>
> -Original Message-
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: Friday, October 18, 2013 10:42 AM
> To: user@mahout.apache.org
> Subject: Re: Mahout 0.8 Random Forest Accuracy
>
> On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut  wrote:
>
>> Has anyone found that Mahout's random forest doesn't perform as well as
>> other implementations? If not, is there any reason why it wouldn't perform
>> as well?
>>
>
> This is disappointing, but not entirely surprising.  There has been
> considerably less effort applied to Mahouts random forest package than the
> comparable R packages.
>
> Note, particularly that the Mahout implementation is not regularized.  That
> could well be a big difference.

Re: Tuning parameters for ALS-WR

2013-09-11 Thread Sean Owen

On Wed, Sep 11, 2013 at 12:22 AM, Parimi Rohit wrote:

> 1. Do we have to follow this setting, to compare algorithms? Can't we
> report the parameter combination for which we get highest mean average
> precision for the test data, when trained on the train set, with out any
> validation set.
>

As Ted alludes to this would overfit. Think of it as two learning
processes. You learn model hyper-parameters like lambda, and you learn
model parameters like your matrix decomposition. So there must be two
levels of hold-out.


> 2. Do we have to tune the "similarityclass" parameter in item-based CF? If
> so, do we compare the mean average precision values based on validation
> data, and then report the same for the test set?
>
>
Yes you are conceptually looking over the entire hyper-parameter space. If
the similarity metric is one of those, you are trying different metrics.
Grid search, just brute-force trying combinations, works for 1-2
hyper-parameters. Otherwise I'd try randomly choosing parameters, really,
or else it will take way too long to explore. You try to pick
hyper-parameters 'nearer' to those that have yielded better scores.

Re: running mahout on Hadoop 2.0.0-cdh4.3.1

2013-09-10 Thread Sean Owen

You are trying to run on Hadoop 2 and Mahout only works with Hadoop 1 and
related branches. This wont work.

However the CDH distributions also come in an 'mr1' flavor that stands a
much better chance of working with something that is built for Hadoop 1.
Use 2.0.0-mr1-4.3.1 instead. (PS 4.3.2 and 4.4.0 are available now)

You will likely still have to compile Mahout again with this different
dependency to get it to work but with any luck that's it.

Sean
On Sep 10, 2013 6:34 PM, "Parimi Rohit"  wrote:

> Hi All,
>
> I am used to running mahout (mahout-core-0.9-SNAPSHOT-job.jar) in the
> Apache Hadoop environment, however, we had to switch to Cloudera
> distribution.
>
> When I try to run the item based collaborative filtering job
> (org.apache.mahout.cf.taste.hadoop.item.RecommenderJob) in the Cloudera
> distribution, I get the following error,
>
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found
> interface org.apache.hadoop.mapreduce.JobContext, but class was expected
> at
> org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
> at
> org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614)
> at
>
> org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob.run(PreparePreferenceMatrixJob.java:75)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at
>
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:158)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at
>
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:312)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>
> Is there a way to run Mahout in the Cloudera environment? I mean, a
> download specific to Cloudera distribution of Hadoop?
>
> Thanks in advance,
> Rohit
>

Re: ALS and SVD feature vectors

2013-09-04 Thread Sean Owen

The feature vectors? rows of X and Y? no, they definitely should not be
normalized. It will change the approximation you so carefully built quite a
lot.

As you say U and V are orthornormal in the SVD. But you still multiply all
of them together with Sigma when making recs. (Or you embed Sigma in U and
V.)  So yes the singular values are used; they give proper weights to
features.

You can think of X and Y as being like that, with Sigma mixed in in some
arbitrary way. Normalizing it would not be valid.

On Wed, Sep 4, 2013 at 6:07 PM, Koobas  wrote:

> In ALS the coincidence matrix is approximated by XY',
> where X is user-feature, Y is item-feature.
> Now, here is the question:
> are/should the feature vectors be normalized before computing
> recommendations?
>
> Now, what happens in the case of SVD?
> The vectors are normal by definition.
> Are singular values used at all, or just left and right singular vectors?
>

Re: Install mahout 0.8 with hadoop 2.0

2013-08-13 Thread Sean Owen

I think it all minimally works on Hadoop 2.0.x, though I haven't tried
it recently -- it does require a recompile.

This is different from it working on MRv2 versus MRv1. I'm almost
certain it does not work on MRv2 and doubt it will.

The effort is not large, but it's subtle. A few hacks may fail in
mysterious ways, and certainly to properly use MRv2 you have to switch
to use the newer resource configuration scheme -- in terms of
megabytes not reducer slots and all that.

At least this was most of the work that i remember when I was
rebuilding some of this type of stuff on MRv2 + Hadoop 2.0.x

On Tue, Aug 13, 2013 at 5:58 PM, Ted Dunning  wrote:
> No.  There is very small demand for Mahout on Hadoop 2.0 so far and the
> forward/backward incompatibility of 2.0 has made it difficult to motivate
> moving to 2.0.
>
> The bigtop guys built a maven profile for 0.23 some time ago.  I don't know
> the status of that.
>
> I don't think that the differences are huge ... it is just the standard
> Hadoop forklift-the-world upgrade experience.
>
>
>
> On Tue, Aug 13, 2013 at 6:49 AM, Sergey Svinarchuk <
> ssvinarc...@hortonworks.com> wrote:
>
>> Hi all,
>>
>> Somebody compile and install mahout with hadoop 2.0? If yes, that what
>> changes you make in mahout, that it have 100% passed unit tests and
>> successful work with hadoop 2.0?
>>
>> Thanks
>>

Re: Data distribution guidance for recommendation engines

2013-07-31 Thread Sean Owen

On Thu, Aug 1, 2013 at 3:15 AM, Chloe Guszo  wrote:
> If I split my data into train and test sets, I can show good performance of

Good performance according to what metric? it makes a lot of
difference whether you are talking about precision/recall or RMSE.

> the model on the train set. What might I expect given an uneven
> distribution of ratings? Imagine a situation where 50% of the ratings are
> 1s, and the rest 2-5. Will the model be biased towards rating items a 1? Do

In the general case, recommenders don't rate items at all, they rank
items. So this may not be a question that matters.

> about the rating scale itself. For example, given [1:3] vs [1:10] ranges,
> in with the former, you've got a 1/3 chance of predicting the correct
> rating, say, while in the latter case it is a 1/10.  Or, when is sparse too

Why do you say that... the recommender is not choosing ratings randomly.

> Ultimately, I'm trying to figure out under what conditions one would look
> at a model and say "that is crap", pardon my language. Do any more

You use evaluation metrics, which are imperfect, but about the best
you can do in the lab. If you're already doing that, you're doing
fine. This is true no matter what your input is like.

If your input is things like click count, then they will certainly be
mostly 1 and follow a power-law distribution. This is no problem but
you want to follow the 'implicit feedback' version of ALS, where you
are not trying to reconstruct the input but use the input as weights.

Re: Latent Dirichlet Allocatio (cvb)

2013-07-31 Thread Sean Owen

FWIW I know Mahout 0.8 works fine with CDH4 (the "mr1" version of
course) and is what CDH5 will include. Should be no problems there.

On Wed, Jul 31, 2013 at 4:33 PM, Marco  wrote:
> great. at least i know what's wrong :)
>
> will check out if cloudera supports mahout 0.8.
>
> meanwhile we'll drop LDA and retry our first approach (k-means)
>
> thanks everyone!
>

Re: Calculating affinity

2013-07-23 Thread Sean Owen

Here's just one perspective --

Yes this is kind of how things like ALS work. The input values are
viewed as 'weights', not ratings. They're not reconstructed directly
but used as a weight in a loss function. This turns out to make more
sense when paired with a squared-error loss function, as it inevitably
is.

The nice thing is that weight-like data can naturally include many
different types of activities, since weights can be added
meaningfully.

But then how do you pick the weights? For this I don't have a strong,
principled answer, maybe someone else does. You can pick something
based on other information you have: if event A happens N times more
than B does, maybe B is N times more significant and deserves N times
as much weight. You can always test various values and evaluate test
metrics to see what works best.

On Tue, Jul 23, 2013 at 2:07 PM, Jayesh  wrote:
> Hi,
>
> Consider this as a newbie question.
>
> I have been reading about CF algorithms. Everyone seems to be taking the
> preference value as ratings, or any singular attribute. However, in a
> typical ecommerce scenario the entire clickstream data is important ( with
> varying weights) to determine the affinity of the user vs item.
>
> So, my question is, in production, do we consider many such parameters to
> calculate user vs item affinity or do we just pick any one parameter.
>
> If we pick any one parameter, how do we decide which is the one that will
> reflect the affinity in the best possible way?
>
> If we consider many parameters, do we use any kind of a regression to
> formulate the affinity score (that takes into consideration all the
> features and their respective weights that impact the users liklehood) and
> run any CF algorithm over these scores?
>
>
> Thanks.
>
> --
> Best Regards,
>
> Jayesh

Re: Issue when running Mahout Recommender Demo

2013-07-19 Thread Sean Owen

I think this is just old, and now you need to run from examples/. I
admit I don't know if this Jetty-based demo is still working or in the
project though. If so it should just be deleted.

On Fri, Jul 19, 2013 at 4:21 AM, Jason Lee  wrote:
> Hi, guys,
>
> I was trying to following the doc
> below:https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation
>
> When I run jetty:run under *mahout-integration*, I am getting a
> ClassNotFoundException:
> org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommender.
>
> I noticed that GroupLensRecommender is belong to *mahout-examples*, so
> i attempt add mahout-examples to dependencies of mahout-integration;
> But, unfortunately, there already has a reversed dependency between
> mahout-examples & mahout-integration, and Circular Dependency is not
> allowed in maven, so i failed.
>
> Btw, i am running mahout 0.8.

Myrrix is now a part of Cloudera

2013-07-16 Thread Sean Owen

This may be relevant enough to announce here:
http://blog.cloudera.com/blog/2013/07/myrrix-joins-cloudera-to-bring-big-learning-to-hadoop/

(Brief recap: Myrrix is a product / project / tiny company related to
large scale-recommenders, and shares some APIs and background with
Mahout.)

I think this indicates just how much interest there is, now more than
ever, in learning at scale. FWIW I will be Director of Data Science at
Cloudera and hope to be able to do more there to advance open-source
big learning in general and along the way Mahout support in particular
in CDH.

Sean

Re: LZ4 file extensions from Mahout recommender

2013-07-04 Thread Sean Owen

This is nothing to do with Mahout, but how your Hadoop cluster is
configured. I assume you have turned map / reduce output compression
and are using the LZO codec.

On Thu, Jul 4, 2013 at 11:06 AM, Sugato Samanta  wrote:
> Hello,
>
> I was trying to execute the recommendation using movie lens data (
> http://www.grouplens.org/node/73). The mahout code is running fine but the
> output files are being generated in LZ4 format. Does any one know how to
> uncompress this type of file in linux?
>
> Cloudera version: cdh4.2
> Linux version: Linux 2.6.18-348.6.1.el5 (red hat)
> Hadoop version: 2.0.0
> Mahout Version: 0.7
>
> Code used:
> /usr/bin/mahout recommenditembased --input mahout_recommender/ratings.csv
> --output mahout_recommender/output_data --tempDir mahout_recommender/tmp
> --usersFile mahout_recommender/users.txt --similarityClassname
> SIMILARITY_COOCCURRENCE
>
> Output files generated:
> [root@INFADDAD19 ~]# hdfs dfs -ls mahout_recommender/output_data
> Found 32 items
> -rw-r--r--   3 root supergroup  0 2013-07-04 05:39
> mahout_recommender/output_data/_SUCCESS
> drwxr-xr-x   - root supergroup  0 2013-07-04 05:38
> mahout_recommender/output_data/_logs
> -rw-r--r--   3 root supergroup   9302 2013-07-04 05:38
> mahout_recommender/output_data/*part-r-0.lz4*
> -rw-r--r--   3 root supergroup   8885 2013-07-04 05:38
> mahout_recommender/output_data/*part-r-1.lz4*
> -rw-r--r--   3 root supergroup  10033 2013-07-04 05:38
> mahout_recommender/output_data/*part-r-2.lz4*
>
> It is generating around 29 LZ4 files but i am specifying only 3 here. Thank
> you.
>
> Regards,
> Sugato

Re: UseConcMarkSweepGC with Mahout

2013-07-02 Thread Sean Owen

This is old-ish advice. I tend to favor UseParallelOldGC even on Java
7, over G1GC, even though it may even be a default now?

The "Old" just means it also uses a parallel collector thread on the
old generation. In general it's good to make use of increasingly
multi-core machines by making GC multi-threaded. I don't know of any
reason PermGen would be handled better or worse by it. Generally
speaking the code I think you are using doesn't create much garbage
anyhow.

G1GC had some issues for me, but it may be weirdness specific to Apple's VM.

You should use whatever you find to be best if you know what you're
doing and have tested.

On Tue, Jul 2, 2013 at 10:46 AM, Aleksei Udatšnõi  wrote:
> JVM tuning section of "Mahout in Action" book recommends to use the
> following GC instead of the default one:
>
>  -XX:+UseParallelGC -XX:+UseParallelOldGC.
>
> Has anyone tries running servlets based on Mahout with UseConcMarkSweepGC
> instead?
>
> Latter garbage collector is known for handling PermGen in Tomcat better
> (especially during undeployment), but I am not yet sure how it will affect
> the performance of Mahout servlets.
>
> The servlets produce user- and item-based recommendations, so its memory
> usage and garbage collection should be as optimal as possible.
>
> Thank you,
> Aleksei

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Sean Owen

On Tue, Jun 25, 2013 at 12:44 AM, Michael Kazekin  wrote:
> But doesn't alternation guarantee convexity?

No, the problem remains non-convex. At each step, where half the
parameters are fixed, yes that constrained problem is convex. But each
of these is not the same as the overall global problem being solved.

> Yeah, but then you start dealing with another problem, how to blend all 
> results together and how doing this affects overall quality of results (in 
> our case recommendations), right?

No you would usually just take the best solution and use it alone. Or
at least, that's a fine thing to do.

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Sean Owen

Yeah this has gone well off-road.

ALS is not non-deterministic because of hardware errors or cosmic
rays. It's also nothing to do with floating-point round-off, or
certainly, that is not the primary source of non-determinism to
several orders of magnitude.

ALS starts from a random solution and this will result in a different
solution. The overall problem is non-convex and the process will not
necessarily converge to the same solution.

Randomness is a common feature of machine learning: centroid selection
in k-means, the 'stochastic' in SGD, random forests, etc. I don't
think the question is why randomness is useful right?

For ALS... I don't quite understand the question, what's the
alternative? certainly I have always seen it formulated in terms of a
random initial solution. You don't want to always start from the same
point because of local minima. Ideally you start from many points and
take the best solution.

On Mon, Jun 24, 2013 at 11:22 PM, Ted Dunning  wrote:
> This is a common chestnut that gets trotted out commonly, but I doubt that
> the effects that the OP was worried about where on the same scale.
>  Non-commutativity of FP arithmetic on doubles rarely has a very large
> effect.
>
>
> On Mon, Jun 24, 2013 at 11:17 PM, Michael Kazekin wrote:
>
>> Any algorithm is non-deterministic because of non-deterministic behavior
>> of underlying hardware, of course :) But that's an offtop. I'm talking
>> about specific implementation of specific algorithm, and in general I'd
>> like to know that at least some very general properties of the algorithm
>> implementation conserve (and why did authors added intentional
>> non-deterministic component to implementation).

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Sean Owen

Yes that should be all that's needed.
On Jun 20, 2013 10:27 AM, "Dan Filimon"  wrote:

> Right, makes sense. So, by normalize, I need to replace the counts in the
> matrix with probabilities.
> So, I would divide everything by the sum of all the counts in the matrix?
>
>
> On Thu, Jun 20, 2013 at 12:16 PM, Sean Owen  wrote:
>
> > I think the quickest answer is: the formula computes the test
> > statistic as a difference of log values, rather than log of ratio of
> > values. By not normalizing, the entropy is multiplied by a factor (sum
> > of the counts) vs normalized. So you do end up with a statistic N
> > times larger when counts are N times larger.
> >
> > On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon
> >  wrote:
> > > My understanding:
> > >
> > > Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared
> > > distribution with 1 degree of freedom in the 2x2 table case.
> > >   A   ~A
> > > B
> > > ~B
> > >
> > > We're testing to see if p(A | B) = p(A | ~B). That's the null
> > hypothesis. I
> > > compute the LLR. The larger that is, the more unlikely the null
> > hypothesis
> > > is to be true.
> > > I can then look at a table with df=1. And I'd get p, the probability of
> > > seeing that result or something worse (the upper tail).
> > > So, the probability of them being similar is 1 - p (which is exactly
> the
> > > CDF for that value of X).
> > >
> > > Now, my question is: in the contingency table case, why would I
> > normalize?
> > > It's a ratio already, isn't it?
> > >
> > >
> > > On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen  wrote:
> > >
> > >> someone can check my facts here, but the log-likelihood ratio follows
> > >> a chi-square distribution. You can figure an actual probability from
> > >> that in the usual way, from its CDF. You would need to tweak the code
> > >> you see in the project to compute an actual LLR by normalizing the
> > >> input.
> > >>
> > >> You could use 1-p then as a similarity metric.
> > >>
> > >> This also isn't how the test statistic is turned into a similarity
> > >> metric in the project now. But 1-p sounds nicer. Maybe the historical
> > >> reason was speed, or, ignorance.
> > >>
> > >> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon
> > >>  wrote:
> > >> > When computing item-item similarity using the log-likelihood
> > similarity
> > >> > [1], can I simply apply a sigmoid do the resulting values to get the
> > >> > probability that two items are similar?
> > >> >
> > >> > Is there any other processing I need to do?
> > >> >
> > >> > Thanks!
> > >> >
> > >> > [1]
> http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html
> > >>
> >
>

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Sean Owen

I think the quickest answer is: the formula computes the test
statistic as a difference of log values, rather than log of ratio of
values. By not normalizing, the entropy is multiplied by a factor (sum
of the counts) vs normalized. So you do end up with a statistic N
times larger when counts are N times larger.

On Thu, Jun 20, 2013 at 9:52 AM, Dan Filimon
 wrote:
> My understanding:
>
> Yes, the log-likelihood ratio (-2 log lambda) follows a chi-squared
> distribution with 1 degree of freedom in the 2x2 table case.
>   A   ~A
> B
> ~B
>
> We're testing to see if p(A | B) = p(A | ~B). That's the null hypothesis. I
> compute the LLR. The larger that is, the more unlikely the null hypothesis
> is to be true.
> I can then look at a table with df=1. And I'd get p, the probability of
> seeing that result or something worse (the upper tail).
> So, the probability of them being similar is 1 - p (which is exactly the
> CDF for that value of X).
>
> Now, my question is: in the contingency table case, why would I normalize?
> It's a ratio already, isn't it?
>
>
> On Thu, Jun 20, 2013 at 11:03 AM, Sean Owen  wrote:
>
>> someone can check my facts here, but the log-likelihood ratio follows
>> a chi-square distribution. You can figure an actual probability from
>> that in the usual way, from its CDF. You would need to tweak the code
>> you see in the project to compute an actual LLR by normalizing the
>> input.
>>
>> You could use 1-p then as a similarity metric.
>>
>> This also isn't how the test statistic is turned into a similarity
>> metric in the project now. But 1-p sounds nicer. Maybe the historical
>> reason was speed, or, ignorance.
>>
>> On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon
>>  wrote:
>> > When computing item-item similarity using the log-likelihood similarity
>> > [1], can I simply apply a sigmoid do the resulting values to get the
>> > probability that two items are similar?
>> >
>> > Is there any other processing I need to do?
>> >
>> > Thanks!
>> >
>> > [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html
>>

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Sean Owen

someone can check my facts here, but the log-likelihood ratio follows
a chi-square distribution. You can figure an actual probability from
that in the usual way, from its CDF. You would need to tweak the code
you see in the project to compute an actual LLR by normalizing the
input.

You could use 1-p then as a similarity metric.

This also isn't how the test statistic is turned into a similarity
metric in the project now. But 1-p sounds nicer. Maybe the historical
reason was speed, or, ignorance.

On Thu, Jun 20, 2013 at 8:53 AM, Dan Filimon
 wrote:
> When computing item-item similarity using the log-likelihood similarity
> [1], can I simply apply a sigmoid do the resulting values to get the
> probability that two items are similar?
>
> Is there any other processing I need to do?
>
> Thanks!
>
> [1] http://tdunning.blogspot.ro/2008/03/surprise-and-coincidence.html

Re: Negative Preferences in a Recommender

2013-06-18 Thread Sean Owen

I'm suggesting using numbers like -1 for thumbs-down ratings, and then
using these as a positive weight towards 0, just like positive values
are used as positive weighting towards 1.

Most people don't make many negative ratings. For them, what you do
with these doesn't make a lot of difference. It might for the few
expert users, and they might be the ones that care. For me it was
exactly this... user acceptance testers were pointing out that
thumbs-down ratings didn't seem to have the desired effect, because
they saw the result straight away.

Here's an alternative structure that doesn't involve thumbs-down:
choose 4 items, and sample in a way to prefer items that are distant
in feature space. Ask the user to pick 1 that is most interesting.
Repeat a few times.

On Tue, Jun 18, 2013 at 3:55 PM, Pat Ferrel  wrote:
> To your point Ted, I was surprised to find that remove-from-cart actions 
> predicted sales almost as well as purchases did but it also meant filtering 
> from recs. We got the best scores treating them as purchases and not 
> recommending them again. No one pried enough to get get bothered.
>
> In this particular case I'm ingesting movie reviews, thumbs up or down. I'm 
> trying to prime the pump for a cold start case of a media guide app with 
> expert reviews but no users yet. Expert reviewers review everything so I 
> don't think there will be much goodness in treating a thumbs down like a 
> thumbs up in this particular case. Sean, are you suggesting that negative 
> reviews might be modeled as a "0" rather than no value? Using the Mahout 
> recommender this will only show up in filtering the negatives out of recs as 
> Ted suggests, right? Since a "0" preference would mean, don't recommend, just 
> as a preference of "1" would. This seems like a good approach but I may have 
> missed something in your suggestion.
>
> In this case I'm not concerned with recommending to experts, I'm trying to 
> make good recs to new users with few thumbs up or down by comparing them to 
> experts with lots of thumbs up and down.The similarity metric will have new 
> users with only a few preferences and will compare them to reviewers with 
> many many more. I wonder if this implies a similarity metric that uses only 
> common values (cooccurrence) rather than the usual log-likelihood? I guess 
> it's easy to try both.
>
> Papers I've read on this subject. The first has an interesting discussion of 
> using experts in CF.
> http://www.slideshare.net/xamat/the-science-and-the-magic-of-user-feedback-for-recommender-systems
> http://www.sis.pitt.edu/~hlee/paper/umap2009_LeeBrusilovsky.pdf

Re: Negative Preferences in a Recommender

2013-06-18 Thread Sean Owen

Yes the model has no room for literally negative input. I think that
conceptually people do want negative input, and in this model,
negative numbers really are the natural thing to express that.

You could give negative input a small positive weight. Or extend the
definition of c so that it is merely small, not negative, when r is
negative. But this was generally unsatisfactory. It has a logic, that
even negative input is really a slightly positive association in the
scheme of things, but the results were viewed as unintuitive.

I ended up extending it to handle negative input more directly, such
that negative input is read as evidence that p=0, instead of evidence
that p=1. This works fine, and tidier than an ensemble (although
that's a sound idea too). The change is quite small.

Agree with the second point that learning weights is manual and
difficult; that's unavoidable I think when you want to start adding
different data types anyway.

I also don't use M/R for searching parameter space since you may try a
thousand combinations and each is a model build from scratch. I use a
sample of data and run in-core.

On Tue, Jun 18, 2013 at 2:30 AM, Dmitriy Lyubimov  wrote:
> (Kinda doing something very close. )
>
> Koren-Volynsky paper on implicit feedback can be generalized to decompose
> all input into preference (0 or 1) and confidence matrices (which is
> essentually an observation weight matrix).
>
> If you did not get any observations, you encode it as (p=0,c=1) but if you
> know that user did not like item, you can encode that observation with much
> more confidence weight, something like (p=0, c=30) -- actually as high
> confidence as a conversion in your case it seems.
>
> The problem with this is that you end up with quite a bunch of additional
> parameters in your model to figure, i.e. confidence weights for each type
> of action in the system. You can establish that thru extensive
> crossvalidation search, which is initially quite expensive (even for
> distributed machine cluster tech), but could be incrementally bail out much
> sooner after previous good guess is already known.
>
> MR doesn't work well for this though since it requires  A LOT of iterations.
>
>
>
> On Mon, Jun 17, 2013 at 5:51 PM, Pat Ferrel  wrote:
>
>> In the case where you know a user did not like an item, how should the
>> information be treated in a recommender? Normally for retail
>> recommendations you have an implicit 1 for a purchase and no value
>> otherwise. But what if you knew the user did not like an item? Maybe you
>> have records of "I want my money back for this junk" reactions.
>>
>> You could make a scale, 0, 1 where 0 means a bad rating and 1 a good, no
>> value as usual means no preference? Some of the math here won't work though
>> since usually no value implicitly = 0 so maybe -1 = bad, 1 = good, no
>> preference implicitly = 0?
>>
>> Would it be better to treat the bad rating as a 1 and good as 2? This
>> would be more like the old star rating method only we would know where the
>> cutoff should be between a good review and bad (1.5)
>>
>> I suppose this could also be treated as another recommender in an ensemble
>> where r = r_p - r_h, where r_h = predictions from "I hate this product"
>> preferences?
>>
>> Has anyone found a good method?

Re: Mahout compatibility with Hadoop

2013-06-17 Thread Sean Owen

Yes you have to refer to the 'mrv1' artifacts if I recall correctly,
if you use CDH4. You are talking about CDH3, which is different.

On Mon, Jun 17, 2013 at 3:23 PM,   wrote:
> Well, I just setup up CDH4 with Mahout for testing a few days ago. It still
> required some fixing of the pom file to build mahout (provided from
> cloudera) and some of the tests failed (the only one I remember had
> something to do with ALS).
> Nevertheless, I just tried it and didn't really test it. CDH4 still comes in
> two flavors, which should be taken into account. One keeps using the "old"
> 0.2x version of Hadoop and the other one uses the alpha 2.xx versions.

Re: Mahout compatibility with Hadoop

2013-06-17 Thread Sean Owen

CDH3 has 0.5 + patches, CDH4 has 0.7 + patches.

I suppose the good news there is that recent versions must work OK
with Hadoop 2.x, since CDH4 = Hadoop 2. (I didn't suspect otherwise.)



On Mon, Jun 17, 2013 at 2:58 PM, Sebastian Schelter
 wrote:
> On 17.06.2013 15:56, Razon, Oren wrote:
>> Thanks Sebastian & Sean. I know Cloudera and other distributions until 
>> lately supported only Mahout 0.5 which made me suspect.
>> I will go and use Mahout 0.8 (assuming it should be officially released any 
>> day now, right?)
>
> I guess the Cloudera folks should do an upgrade, 0.5 is really an
> ancient release.
>
> We have two or three issues left for 0.8, then we'll have a code freeze
> and do testing before we release 0.8.
>
> -sebastian
>
>>
>>
>> -Original Message-
>> From: Sean Owen [mailto:sro...@gmail.com]
>> Sent: Monday, June 17, 2013 4:53 PM
>> To: Mahout User List
>> Subject: Re: Mahout compatibility with Hadoop
>>
>> Is it compatible with any Hadoop release? of course, would it make sense if 
>> not?
>> I'm not sure where you get this idea. 0.5 was, I think, compiled vs 0.20.x. 
>> The last release was vs 1.0.3 or so. The current release is vs 1.1.x. In all 
>> cases these are the latest stable Apache releases, so not sure what you are 
>> referring to.
>>
>> On Mon, Jun 17, 2013 at 2:41 PM, Razon, Oren  wrote:
>>> Hi,
>>> From what I saw so far it seem that only Mahout 0.5  (the Hadoop part) is 
>>> compatible with Hadoop latest releases, and that later Mahout releases are 
>>> not officially supported.
>>> I wonder if anyone know if Mahout 0.7\0.6 is compatible with any Hadoop 
>>> release?
>>>
>>> Thanks,
>>> Oren
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -
>>> Intel Electronics Ltd.
>>>
>>> This e-mail and any attachments may contain confidential material for
>>> the sole use of the intended recipient(s). Any review or distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>> -
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
>

Re: Mahout compatibility with Hadoop

2013-06-17 Thread Sean Owen

Is it compatible with any Hadoop release? of course, would it make sense if not?
I'm not sure where you get this idea. 0.5 was, I think, compiled vs
0.20.x. The last release was vs 1.0.3 or so. The current release is vs
1.1.x. In all cases these are the latest stable Apache releases, so
not sure what you are referring to.

On Mon, Jun 17, 2013 at 2:41 PM, Razon, Oren  wrote:
> Hi,
> From what I saw so far it seem that only Mahout 0.5  (the Hadoop part) is 
> compatible with Hadoop latest releases, and that later Mahout releases are 
> not officially supported.
> I wonder if anyone know if Mahout 0.7\0.6 is compatible with any Hadoop 
> release?
>
> Thanks,
> Oren
>
>
>
>
>
>
>
>
> -
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.

Re: Running Mahout recommendations on a Cassandra data set

2013-06-12 Thread Sean Owen

This is more of a Hadoop question. The input hides behind the
InputFormat implementation. If you have an InputFormat that can read
and produce the same key-value pairs that you'd get from a
SequenceFileInputFormat / TextInputFormat and HDFS, yes the rest just
works automatically. You have to modify the code though.

On Wed, Jun 12, 2013 at 10:39 AM, duni...@gmail.com  wrote:
> Hi folks,
>
> I have a large user preference data set stored in a Cassandra node.
>
> I need to run distributed (Hadop based) Mahout recommender job on that data
> set and  write the recommendations into MySQL database.
>
> 1. Will it be possible with Mahout? If possible, what are the
> configurations required?
> 2. Is org.apache.mahout.cf.taste.hadoop.item.RecommenderJob class limited
> only to read input dataset from HDFS? or is it pluggable?
>
> Thanks ad regards,
> Dunith Dhanushka

Re: [DRAFT] 0.8 Release Announcement + Future Plans Discussion

2013-06-08 Thread Sean Owen

I agree with deprecating all of that FWIW.

On Sat, Jun 8, 2013 at 6:33 PM, Grant Ingersoll  wrote:
>> Collaborative Filtering:
>>
>> - all recommenders in o.a.m.cf.taste.impl.recommender.knn
>>
>> - the TreeClusteringRecommender in o.a.m.cf.taste.impl.recommender
>>
>> - the SlopeOne implementations in o.a.m..cf.taste.hadoop.slopeone and
>> o.a.m.cf.taste.impl.recommender.slopeone
>>
>> - the distributed pseudo recommender in o.a.m.cf.taste.hadoop.pseudo
>
> Pseudo is useful, no?  Don't know about the others.

Re: Social Network Link Prediction in Mahout

2013-06-08 Thread Sean Owen

Use an implementation that doesn't expect a rating. These are
so-called 'boolean' implementations, like GenericBooleanPrefDataModel.
For example you can build and item-based recommender with the boolean
version of item based recommender and a log-likelihood similarity.

Or, yes you can calculate some meaningful edge weight to add more info
to your model. Maybe the number of times the two users interacted? the
resulting number can be used as a 'rating' although I don't know if
you will get great results since it doesn't act a lot like a rating.
Instead, use the log of this number.

Or, use an algorithm that is comfortable with count-like input, like
ALS with the "implicit data" option turned on.

Sean

On Sat, Jun 8, 2013 at 2:15 PM, Peter Holland  wrote:
> Hi All,
> I am trying to use Mahout for Link Prediction in a Social Network.
>
> The data I have is an edges list with 9.4 million rows. The edge list is a
> csv vile where each node is an integer value and a row represents a edge
> between two nodes. For example;
>
> 3432, 5098
> 3423, 6710
> 4490, 5843
> 4490, 2039
> .
>
> This is a directed graph so row 1 means that node 3432 follows node 5098.
>
> I would like to build a recommender to calculate the top 10 nodes a user
> might like to connect to next. The problem I have is that the recommender
> classes needs input in the form (user, item, value).  So, how can I first
> calculate a value to represent the 'weight' of an edge? For example
> EdgeRank?
>
> Any help would be greatly appreciated.
> Thank you,
> Peter

Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen

I believe the suggestion is just for purposes of evaluation. You would
not return these items in practice, yes.

Although there are cases where you do want to return known items. For
example, maybe you are modeling user interaction with restaurant
categories. This could be useful, because as soon as you see I
interact with "Chinese" and "Indian" you may recommend "Thai"; it
might even be a stronger recommendation than the two known categories.
But I may not want to actually exclude Chinese and Indian from the
list entirely.

On Fri, Jun 7, 2013 at 10:36 PM,   wrote:
> But why would she want the things she has?

Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen

Yes it makes sense in the case of for example ALS.
With or without this idea, the more general point is that this result
is still problematic. It is somewhat useful in comparing in a relative
sense; I'd rather have a recommender that stacks my input values
somewhere near the top than bottom. But metrics like precision@5 get
hard to interpret -- because they are often near 0 even when things
are working reasonably well. Mean average precision considers the
results in a more complete sense, as would AUC.

On Fri, Jun 7, 2013 at 10:04 PM, Koobas  wrote:
> On Fri, Jun 7, 2013 at 4:50 PM, Sean Owen  wrote:
>
>> It depends on the algorithm I suppose. In some cases, the
>> already-known items would always be top recommendations and the test
>> would tell you nothing. Just like in an RMSE test -- if you already
>> know the right answers your score is always a perfect 0.
>>
>> It's very much to the point.
> ALS works by constructing a low-rank approximation of the original matrix.
> We check how good that approximation it by comparing it against the
> original.
>
> I see an analogy here, in the case of kNN.
> The suggestions are a model of your interests, in a sense can be used to
> reconstruct
> your original set.
>
>
>> But in some cases I agree you could get some of use out of observing
>> where the algorithm ranks known associations, because they won't in
>> some cases all be the very first ones.
>>
>> it raises an interesting question: if the top recommendation wasn't an
>> already known association, how do we know it's "wrong"? We don't. You
>> rate Star Trek, Star Trek V, and Star Trek IV. Say Star Trek II is
>> your top recommendation. That's actually probably right, and should be
>> ranked higher than all your observed associations. (It's a good
>> movie.) But the test would consider it wrong. In fact anything that
>> you haven't interacted with before is "wrong".
>>
>> You can look at it from the other side.
> It's not about the ones that are not in your original set.
> It's about how good the recommender is in putting back the original, if
> they were removed.
> Except we would not actually be removing them.
> It's the same approach, simply without splitting the input into the
> training set and the validation set.
> In a sense the whole set is the training set and the validation set.
> Again, I am not coming from the ML background.
> Am I making sense here?
>
>
>> This sort of explains why precision/recall can be really low in these
>> tests. I would not be surprised if you get 0 in some cases, on maybe
>> small input. Is it a bad predictor? maybe, but it's not clear.
>>
>>
>>
>> On Fri, Jun 7, 2013 at 8:06 PM, Koobas  wrote:
>> > Since I am primarily an HPC person, probably a naive question from the ML
>> > perspective.
>> > What if, when computing recommendations, we don't exclude what the user
>> > already has,
>> > and then see if the items he has end up being recommended to him (compute
>> > some appropriate metric / ratio)?
>> > Wouldn't that be the ultimate evaluator?
>> >
>> >
>> > On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen  wrote:
>> >
>> >> In point 1, I don't think I'd say it that way. It's not true that
>> >> test/training is divided by user, because every user would either be
>> >> 100% in the training or 100% in the test data. Instead you hold out
>> >> part of the data for each user, or at least, for some subset of users.
>> >> Then you can see whether recs for those users match the held out data.
>> >>
>> >> Yes then you see how the held-out set matches the predictions by
>> >> computing ratios that give you precision/recall.
>> >>
>> >> The key question is really how you choose the test data. It's implicit
>> >> data; one is as good as the next. In the framework I think it just
>> >> randomly picks a subset of the data. You could also split by time;
>> >> that's a defensible way to do it. Training data up to time t and test
>> >> data after time t.
>> >>
>> >> On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov
>> >>  wrote:
>> >> > I'm trying to evaluate a few different recommenders based on boolean
>> >> > preferences.  The in action book suggests using an precision/recall
>> >> metric,
>> >> > but I'm not sure I understand what that does, and in particular how
>> it is
>>

Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen

It depends on the algorithm I suppose. In some cases, the
already-known items would always be top recommendations and the test
would tell you nothing. Just like in an RMSE test -- if you already
know the right answers your score is always a perfect 0.

But in some cases I agree you could get some of use out of observing
where the algorithm ranks known associations, because they won't in
some cases all be the very first ones.

it raises an interesting question: if the top recommendation wasn't an
already known association, how do we know it's "wrong"? We don't. You
rate Star Trek, Star Trek V, and Star Trek IV. Say Star Trek II is
your top recommendation. That's actually probably right, and should be
ranked higher than all your observed associations. (It's a good
movie.) But the test would consider it wrong. In fact anything that
you haven't interacted with before is "wrong".

This sort of explains why precision/recall can be really low in these
tests. I would not be surprised if you get 0 in some cases, on maybe
small input. Is it a bad predictor? maybe, but it's not clear.

On Fri, Jun 7, 2013 at 8:06 PM, Koobas  wrote:
> Since I am primarily an HPC person, probably a naive question from the ML
> perspective.
> What if, when computing recommendations, we don't exclude what the user
> already has,
> and then see if the items he has end up being recommended to him (compute
> some appropriate metric / ratio)?
> Wouldn't that be the ultimate evaluator?
>
>
> On Fri, Jun 7, 2013 at 2:58 PM, Sean Owen  wrote:
>
>> In point 1, I don't think I'd say it that way. It's not true that
>> test/training is divided by user, because every user would either be
>> 100% in the training or 100% in the test data. Instead you hold out
>> part of the data for each user, or at least, for some subset of users.
>> Then you can see whether recs for those users match the held out data.
>>
>> Yes then you see how the held-out set matches the predictions by
>> computing ratios that give you precision/recall.
>>
>> The key question is really how you choose the test data. It's implicit
>> data; one is as good as the next. In the framework I think it just
>> randomly picks a subset of the data. You could also split by time;
>> that's a defensible way to do it. Training data up to time t and test
>> data after time t.
>>
>> On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov
>>  wrote:
>> > I'm trying to evaluate a few different recommenders based on boolean
>> > preferences.  The in action book suggests using an precision/recall
>> metric,
>> > but I'm not sure I understand what that does, and in particular how it is
>> > dividing my data into test/train sets.
>> >
>> > What I think I'd like to do is:
>> >
>> > 1. Divide the test data by user: identify a set of training data with
>> data
>> > from 80% of the users, and test using the remaining 20% (say).
>> >
>> > 2. Build a similarity model from the training data
>> >
>> > 3. For the test users, divide their data in half; a "training" set and an
>> > evaluation set.  Then for each test user, use their training data as
>> input
>> > to the recommender, and see if it recommends the data in the evaluation
>> set
>> > or not.
>> >
>> > Is this what the precision/recall test is actually doing?
>> >
>> > --
>> > Michael Sokolov
>> > Senior Architect
>> > Safari Books Online
>> >
>>

Re: evaluating recommender with boolean prefs

2013-06-07 Thread Sean Owen

In point 1, I don't think I'd say it that way. It's not true that
test/training is divided by user, because every user would either be
100% in the training or 100% in the test data. Instead you hold out
part of the data for each user, or at least, for some subset of users.
Then you can see whether recs for those users match the held out data.

Yes then you see how the held-out set matches the predictions by
computing ratios that give you precision/recall.

The key question is really how you choose the test data. It's implicit
data; one is as good as the next. In the framework I think it just
randomly picks a subset of the data. You could also split by time;
that's a defensible way to do it. Training data up to time t and test
data after time t.

On Fri, Jun 7, 2013 at 7:51 PM, Michael Sokolov
 wrote:
> I'm trying to evaluate a few different recommenders based on boolean
> preferences.  The in action book suggests using an precision/recall metric,
> but I'm not sure I understand what that does, and in particular how it is
> dividing my data into test/train sets.
>
> What I think I'd like to do is:
>
> 1. Divide the test data by user: identify a set of training data with data
> from 80% of the users, and test using the remaining 20% (say).
>
> 2. Build a similarity model from the training data
>
> 3. For the test users, divide their data in half; a "training" set and an
> evaluation set.  Then for each test user, use their training data as input
> to the recommender, and see if it recommends the data in the evaluation set
> or not.
>
> Is this what the precision/recall test is actually doing?
>
> --
> Michael Sokolov
> Senior Architect
> Safari Books Online
>

Re: Database connection pooling for a recommendation engine

2013-06-05 Thread Sean Owen

Not sure, is this really related to Mahout?

I don't know of an equivalent of J2EE / Tomcat for C++, but there must
be something.

As a general principle, you will have to load your data into memory if
you want to perform the computations on the fly in real time. So how
you access the data isn't so important, just because you will be
reading it all at once.

On Wed, Jun 5, 2013 at 12:44 PM, Mike W.  wrote:
> Hello,
>
> I am considering to implement a recommendation engine for a small size
> website. The website will employ LAMP stack, and for some reasons the
> recommendation engine must be written in C++. It consists of an On-line
> Component and Off-line Component, both need to connect to MySQL. The
> difference is that On-line Component will need a connection pool, whereas
> several persistent connections or even connect as required would be
> sufficient for the Off-line Component, since it does not require real time
> performance in a concurrent requests scenario as in On-line Component.
>
> On-line Component is to be wrapped as a web service via Apache AXIS2. The
> PHP frontend app on Apache http server retrieves recommendation data from
> this web service module.
>
> There are two DB connection options for On-line Component I can think of:
> 1. Use ODBC connection pool, I think unixODBC might be a candidate. 2. Use
> connection pool APIs that come as a part of Apache HTTP server. mod_dbd
> would be a choice.http://httpd.apache.org/docs/2.2/mod/mod_dbd.html
>
> As for Off-line Component, a simple DB connection option is direct
> connection using ODBC.
>
> Due to lack of web app design experience, I have the following questions:
>
> Option 1 for On-line Component is a tightly coupled design without taking
> advantage of pooling APIs in Apache HTTP server. But if I choose Option 2
> (3-tiered architecture), as a standalone component apart from Apache HTTP
> server, how to use its connection pool APIs?
>
> A Java application can be deployed as a WAR file and contained in a servlet
> container such as tomcat(See Mahout in Action, section 5.5), or it can
> use org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource
> (
> https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation).
> Is there any similar approach for my C++ recommendation engine?
>
> I am not sure if I made a proper prototype. Any suggestions will be
> appreciated:)
>
> Thanks,
>
> Mike

Re: IRStats Evaluation for Recommender Systems

2013-05-30 Thread Sean Owen

THere's nothing direct, but you can probably save yourself time by copying
the code that computes these stats and apply them to your pre-computed
values. It's not terribly complex, just counting the intersection and union
size and deriving some stats from it.

The split is actually based on value -- higher-valued items are held out.
This has problems, but different ones from holding out newest data (i.e.
what if new ratings are all very negative? you don't want to have
recommended them). But both are sensible. There is a hook
called RelevantItemsDataSplitter that lets you define how the split is
computed though.

Sean

On Thu, May 30, 2013 at 1:14 PM, Parimi Rohit wrote:

> Hi All,
>
> Is there a way to compute precision and recall values given a file of
> recommendations and a test file of user preferences.
>
> I know there is "GenericRecommenderIRStatsEvaluator" in Mahout to compute
> the IR Stats but it takes a "RecommenderBuilder" object among others as
> parameters to build a recommender and compute these metrics. However, if I
> already have a file of recommendations and a test file of preferences, I
> will not be able to use this class.
>
> Another use-case is when my data is temporal i.e I use past data for about
> a month to train my model and test the recommendations using 1 week future
> data (backtesting framework). I will not be able to use the above class as
> it splits the data randomly (I may be wrong in this case).
>
> To Summarize, I would like to compute the IR stats for a file of
> recommendations and a test file of use preferences and would like to know
> if this can be done using some class in Mahout.
>
> Any help is much appreciated.
>
> Thanks,
> Rohit
>

Re: mahout ssvd tuning problem

2013-05-22 Thread Sean Owen

I looked, and this job already uses a combiner called OuterProductCombiner.
In fact it was right there in the stack trace, oops.  At least, it shows
this is happening in the mapper and the combiner is trying to do its job.

I am still pretty sure both io.sort.* parameters are relevant here.

Anyway I found what I was thinking of, yes this appears to be a known bug
which is about to be fixed:

https://issues.apache.org/jira/browse/MAPREDUCE-5028


On Wed, May 22, 2013 at 6:35 PM, Dmitriy Lyubimov  wrote:

> i am actually not sure how to manipulate use of combiners in hadoop. All i
> can say that the code does make extensive use of combiners but they were
> always "on" for me. I had no idea one might turn their use off.
>
>
> On Wed, May 22, 2013 at 6:17 AM, Jakub Pawłowski
> wrote:
>
> > Yes, I was manipulating io.sort.factor too, it speeds up reducer, values
> > around 30 gives good result for me.
> > But my problem is not reducer, my problem is Bt-job map taks that spills
> > to drive.
> >
> > You mentioned Combiner, how can I turn it on ? I'm running my job from
> > console like that
> >
> > mahout ssvd --rank 400 --computeU true --computeV true --reduceTasks 3
> >  --input ${INPUT} --output ${OUTPUT} -ow --tempDir /tmp/ssvdtmp/
> >
> > document at https://cwiki.apache.org/**MAHOUT/stochastic-singular-**
> > value-decomposition.data/SSVD-**CLI.pdf<
> https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf>doesn't
> mention anything about combiner.
> >
> > Thanks for your answer.
> >
> >
> >
> > W dniu 22.05.2013 14:59, Sean Owen pisze:
> >
> >  I feel like I've seen this too and it's just a bug. You're not running
> >> out of memory.
> >>
> >> Are you also setting io.sort.factor? that can help too. You might try
> >> as high as 100.
> >>
> >> Also have you tried a Combiner? if you can apply it it should help too
> >> as it is designed to reduce the amount of stuff spilled.
> >>
> >>
> >
>

Re: mahout ssvd tuning problem

2013-05-22 Thread Sean Owen

I mean you would have to write one and modify the code to use it. I
don't know this job well enough to know whether it's possible or not
though. At least, this is getting directly at reducing the amount of
data spilled, rather than reducing the intermediate I/O needed to sort
it.

Doesn't io.sort.* also affect the mapper? I was sure it did. Maybe it
only matters when a combiner is in play on the mapper side.

On Wed, May 22, 2013 at 2:17 PM, Jakub Pawłowski
 wrote:
> Yes, I was manipulating io.sort.factor too, it speeds up reducer, values
> around 30 gives good result for me.
> But my problem is not reducer, my problem is Bt-job map taks that spills to
> drive.
>
> You mentioned Combiner, how can I turn it on ? I'm running my job from
> console like that
>
> mahout ssvd --rank 400 --computeU true --computeV true --reduceTasks 3
> --input ${INPUT} --output ${OUTPUT} -ow --tempDir /tmp/ssvdtmp/
>
> document at
> https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.data/SSVD-CLI.pdf
> doesn't mention anything about combiner.
>
> Thanks for your answer.

Re: mahout ssvd tuning problem

2013-05-22 Thread Sean Owen

I feel like I've seen this too and it's just a bug. You're not running
out of memory.

Are you also setting io.sort.factor? that can help too. You might try
as high as 100.

Also have you tried a Combiner? if you can apply it it should help too
as it is designed to reduce the amount of stuff spilled.

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen

(I had in mind non distributed parts of Mahout but the principle is
similar, yes.)
On May 19, 2013 6:27 PM, "Pat Ferrel"  wrote:

> Using a Hadoop version of a Mahout recommender will create some number of
> recs for all users as its output. Sean is talking about Myrrix I think
> which uses factorization to get much smaller models and so can calculate
> the recs at runtime for fairly large user sets.
>
> However if you are using Mahout and Hadoop the question is how to store
> and lookup recommendations in the quickest scalable way. You will have a
> user ID and perhaps an item ID as a key to the list of recommendations. The
> fastest thing to do is have a hashmap in memory, perhaps read in from HDFS.
> Remember that Mahout will output the recommendations with internal Mahout
> IDs so you will have to replace these in the data with your actual user and
> item ids.
>
> I use a NoSQL DB, either MongoDB or Cassandra but others are fine too,
> even MySQL if you can scale it to meet your needs. I end up with two
> tables, one has my user ID as a key and recommendations with my item IDs
> either ordered or with strengths. The second table has my item ID as the
> key with a list of similar items (again sorted or with strengths). At
> runtime I may have both a user ID and an item ID context so I get a list
> from both tables and combine them at runtime.
>
> I use a DB for many reasons and let it handle the caching. I never need to
> worry about memory management. If you have scaled your DB properly the
> lookups will actually be executed like an in-memory hashmap with indexed
> keys for ids. Scaling the DB can be done as your user base grows when
> needed without affecting the rest of the calculation pipeline. Yes there
> will be overhead due to network traffic in a cluster but the flexibility is
> worth it for me. If high availability is important you can spread out your
> db cluster over multiple data centers without affecting the API for serving
> recommendations. I set up the recommendation calculation to run
> continuously in the background, replacing values in the two tables as fast
> as I can. This allows you to scale update speed (how many machines in the
> mahout/hadoop cluster) independently from lookup performance scaling (how
> many machines in your db cluster, how much memory do the db machine have).
>
> On May 19, 2013, at 11:45 AM, Manuel Blechschmidt <
> manuel.blechschm...@gmx.de> wrote:
>
> Hi Tevfik,
> I am working with mysql but I would guess that HDFS like Sean suggested
> would be a good idea as well.
>
> There is also a project called sqoop which can be used to transfer data
> from relation databases to Hadoop.
>
> http://sqoop.apache.org/
>
> Scribe might be also an option for transferring a lot of data:
> https://github.com/facebook/scribe#readme
>
> I would suggest that you just start with the technology that you know best
> and then if you solve the problem as soon as you get them.
>
> /Manuel
>
> Am 19.05.2013 um 20:26 schrieb Sean Owen:
>
> > I think everyone is agreeing that it is essential to only access
> > information in memory at run-time, yes, whatever that info may be.
> > I don't think the original question was about Hadoop, but, the answer
> > is the same: Hadoop mappers are just reading the input serially. There
> > is no advantage to a relational database or NoSQL database; they're
> > just overkill. HDFS is sufficient, and probably even best of these at
> > allowing fast serial access to the data.
> >
> > On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
> >  wrote:
> >> Hi Manuel,
> >> But if one uses matrix factorization and stores the user and item
> >> factors in memory then there will be no database access during
> >> recommendation.
> >> I thought that the original question was where to store the data and
> >> how to give it to hadoop.
> >>
> >> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
> >>  wrote:
> >>> Hi Tevfik,
> >>> one request to the recommender could become more then 1000 queries to
> the database depending on which recommender you use and the amount of
> preferences for the given user.
> >>>
> >>> The problem is not if you are using SQL, NoSQL, or any other query
> language. The problem is the latency of the answers.
> >>>
> >>> An average tcp package in the same data center takes 500 µs. A main
> memory reference 0,1 µs. This means that your main memory of your java
> process can be accessed 5000 times faster then any other process like a
> database connected via TCP/IP.
> >>>
> >>> http://www.eecs.ber

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen

(Oh, by the way, I realize the original question was about Hadoop. I
can't read carefully.)

No, HDFS is not good for anything like random access. For input,
that's OK, because you don't need random access. So HDFS is just fine.
For output, if you are going to then serve these precomputed results
at run-time, they need to be in a container appropriate for quick
random access. There, a NoSQL store like HBase or something does sound
appropriate. You can create an output format that writes directly into
it, with a little work.

The drawbacks to this approach -- computing results in Hadoop -- is
that they are inevitably a bit stale, not real-time, and you have to
compute results for everyone, even though very few of those results
will be used. Of course, serving is easy and fast. There are hybrid
solutions that I can talk to you about offline that get a bit of the
best of both worlds.


On Sun, May 19, 2013 at 11:37 AM, Ahmet Ylmaz
 wrote:
> Hi Sean,
> If I understood you correctly you are saying that I will not need mysql. But 
> if I store my data on HDFS will I be make fast queries such as
> "Return all the ratings of a specific user"
> which will be needed for showing the past ratings of a user.
>
> Ahmet
>
>
> 
>  From: Sean Owen 
> To: Mahout User List 
> Sent: Sunday, May 19, 2013 9:26 PM
> Subject: Re: Which database should I use with Mahout
>
>
> I think everyone is agreeing that it is essential to only access
> information in memory at run-time, yes, whatever that info may be.
> I don't think the original question was about Hadoop, but, the answer
> is the same: Hadoop mappers are just reading the input serially. There
> is no advantage to a relational database or NoSQL database; they're
> just overkill. HDFS is sufficient, and probably even best of these at
> allowing fast serial access to the data.
>
> On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
>  wrote:
>> Hi Manuel,
>> But if one uses matrix factorization and stores the user and item
>> factors in memory then there will be no database access during
>> recommendation.
>> I thought that the original question was where to store the data and
>> how to give it to hadoop.
>>
>> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>>  wrote:
>>> Hi Tevfik,
>>> one request to the recommender could become more then 1000 queries to the 
>>> database depending on which recommender you use and the amount of 
>>> preferences for the given user.
>>>
>>> The problem is not if you are using SQL, NoSQL, or any other query 
>>> language. The problem is the latency of the answers.
>>>
>>> An average tcp package in the same data center takes 500 µs. A main memory 
>>> reference 0,1 µs. This means that your main memory of your java process can 
>>> be accessed 5000 times faster then any other process like a database 
>>> connected via TCP/IP.
>>>
>>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>>>
>>> Here you can see a screenshot that shows that database communication is by 
>>> far (99%) the slowest component of a recommender request:
>>>
>>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>>>
>>> If you do not want to cache your data in your Java process you can use a 
>>> complete in memory database technology like SAP HANA 
>>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>>>
>>> Nevertheless if you are using these you do not need Mahout anymore.
>>>
>>> An architecture of a Mahout system can be seen here:
>>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>>>
>>> Hope that helps
>>> Manuel
>>>
>>> Am 19.05.2013 um 19:20 schrieb Sean Owen:
>>>
>>>> I'm first saying that you really don't want to use the database as a
>>>> data model directly. It is far too slow.
>>>> Instead you want to use a data model implementation that reads all of
>>>> the data, once, serially, into memory. And in that case, it makes no
>>>> difference where the data is being read from, because it is read just
>>>> once, serially. A file is just as fine as a fancy database. In fact
>>>> it's probably easier and faster.
>>>>
>>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>>>>  wrote:
>>>>> Thanks Sean, but I could not get your answer. Can you please explain it 
>>>>> again?
>>>>>
>>&

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen

I think everyone is agreeing that it is essential to only access
information in memory at run-time, yes, whatever that info may be.
I don't think the original question was about Hadoop, but, the answer
is the same: Hadoop mappers are just reading the input serially. There
is no advantage to a relational database or NoSQL database; they're
just overkill. HDFS is sufficient, and probably even best of these at
allowing fast serial access to the data.

On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
 wrote:
> Hi Manuel,
> But if one uses matrix factorization and stores the user and item
> factors in memory then there will be no database access during
> recommendation.
> I thought that the original question was where to store the data and
> how to give it to hadoop.
>
> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>  wrote:
>> Hi Tevfik,
>> one request to the recommender could become more then 1000 queries to the 
>> database depending on which recommender you use and the amount of 
>> preferences for the given user.
>>
>> The problem is not if you are using SQL, NoSQL, or any other query language. 
>> The problem is the latency of the answers.
>>
>> An average tcp package in the same data center takes 500 µs. A main memory 
>> reference 0,1 µs. This means that your main memory of your java process can 
>> be accessed 5000 times faster then any other process like a database 
>> connected via TCP/IP.
>>
>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>>
>> Here you can see a screenshot that shows that database communication is by 
>> far (99%) the slowest component of a recommender request:
>>
>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>>
>> If you do not want to cache your data in your Java process you can use a 
>> complete in memory database technology like SAP HANA 
>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>>
>> Nevertheless if you are using these you do not need Mahout anymore.
>>
>> An architecture of a Mahout system can be seen here:
>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>>
>> Hope that helps
>> Manuel
>>
>> Am 19.05.2013 um 19:20 schrieb Sean Owen:
>>
>>> I'm first saying that you really don't want to use the database as a
>>> data model directly. It is far too slow.
>>> Instead you want to use a data model implementation that reads all of
>>> the data, once, serially, into memory. And in that case, it makes no
>>> difference where the data is being read from, because it is read just
>>> once, serially. A file is just as fine as a fancy database. In fact
>>> it's probably easier and faster.
>>>
>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>>>  wrote:
>>>> Thanks Sean, but I could not get your answer. Can you please explain it 
>>>> again?
>>>>
>>>>
>>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen  wrote:
>>>>> It doesn't matter, in the sense that it is never going to be fast
>>>>> enough for real-time at any reasonable scale if actually run off a
>>>>> database directly. One operation results in thousands of queries. It's
>>>>> going to read data into memory anyway and cache it there. So, whatever
>>>>> is easiest for you. The simplest solution is a file.
>>>>>
>>>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>>>>>  wrote:
>>>>>> Hi,
>>>>>> I would like to use Mahout to make recommendations on my web site. Since 
>>>>>> the data is going to be big, hopefully, I plan to use hadoop 
>>>>>> implementations of the recommender algorithms.
>>>>>>
>>>>>> I'm currently storing the data in mysql. Should I continue with it or 
>>>>>> should I switch to a nosql database such as mongodb or something else?
>>>>>>
>>>>>> Thanks
>>>>>> Ahmet
>>
>> --
>> Manuel Blechschmidt
>> M.Sc. IT Systems Engineering
>> Dortustr. 57
>> 14467 Potsdam
>> Mobil: 0173/6322621
>> Twitter: http://twitter.com/Manuel_B
>>

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen

I'm first saying that you really don't want to use the database as a
data model directly. It is far too slow.
Instead you want to use a data model implementation that reads all of
the data, once, serially, into memory. And in that case, it makes no
difference where the data is being read from, because it is read just
once, serially. A file is just as fine as a fancy database. In fact
it's probably easier and faster.

On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
 wrote:
> Thanks Sean, but I could not get your answer. Can you please explain it again?
>
>
> On Sun, May 19, 2013 at 8:00 PM, Sean Owen  wrote:
>> It doesn't matter, in the sense that it is never going to be fast
>> enough for real-time at any reasonable scale if actually run off a
>> database directly. One operation results in thousands of queries. It's
>> going to read data into memory anyway and cache it there. So, whatever
>> is easiest for you. The simplest solution is a file.
>>
>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>>  wrote:
>>> Hi,
>>> I would like to use Mahout to make recommendations on my web site. Since 
>>> the data is going to be big, hopefully, I plan to use hadoop 
>>> implementations of the recommender algorithms.
>>>
>>> I'm currently storing the data in mysql. Should I continue with it or 
>>> should I switch to a nosql database such as mongodb or something else?
>>>
>>> Thanks
>>> Ahmet

Re: Which database should I use with Mahout

2013-05-19 Thread Sean Owen

It doesn't matter, in the sense that it is never going to be fast
enough for real-time at any reasonable scale if actually run off a
database directly. One operation results in thousands of queries. It's
going to read data into memory anyway and cache it there. So, whatever
is easiest for you. The simplest solution is a file.

On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
 wrote:
> Hi,
> I would like to use Mahout to make recommendations on my web site. Since the 
> data is going to be big, hopefully, I plan to use hadoop implementations of 
> the recommender algorithms.
>
> I'm currently storing the data in mysql. Should I continue with it or should 
> I switch to a nosql database such as mongodb or something else?
>
> Thanks
> Ahmet

Re: How to extend FileDataModel

2013-05-15 Thread Sean Owen

Why not? it's just the object reference that is local to the function.
The Map itself is not, and on the heap like everything else in the
JVM.

On Thu, May 16, 2013 at 2:19 AM, huangjia  wrote:
> Hi,
>
> I want to build a recommendation model based on Mahout. My dataset format
> is in the format of
>
> userID, itemID, rating timestamp tag1 tag2 tag3. Thus, I think I need to
> extend the FileDataModel.
>
> I looked into *JesterDataModel* as an example. However, I have a problem
> with the logic flow. In its *buildModel()* method, an empty map "data" is
> first constructed. It is then thrown into processFile. I assume that "data"
> is modified in this method, since later it is used to construct the
> GenericDataModel. However, "data" is a local variable instead of a class
> variable, so how is it modified?
>
> processFile(iterator, data, timestamps, false);
> return new GenericDataModel(GenericDataModel.toDataMap(data, true));
>
>
> --
> Jia

Re: How to execute RecommenderJob without preference value

2013-05-11 Thread Sean Owen

You can't have a blank line, if that's what you mean, yes. That's not
a valid record. A terminal newline is fine.
But the error seems to be something else:

java.io.FileNotFoundException: File does not exist:
/user/hadoop/temp/preparePreferenceMatrix/numUsers.bin

Re: ALSWR MovieLens 100k

2013-05-09 Thread Sean Owen

Yes, you overfit the training data set, so you "under-fit" the test
set. I'm trying to suggest why more degrees of freedom (features)
makes for a "worse" fit. It doesn't, on the training set, but those
same parameters may fit the test set increasingly badly.

It doesn't make sense to evaluate on a training set.

On Thu, May 9, 2013 at 3:21 PM, Gabor Bernat  wrote:
> Yes, but overfitting is for train dataset isn't it? However, now I'm
> evaluating on a test dataset (which is sampled from the whole dataset, but
> that still makes it test), so don't really understand how can overfitting
> become an issue. :-?
>
> Is there any class/function to make the evaluation on the train dataset
> instead?

Re: ALSWR MovieLens 100k

2013-05-09 Thread Sean Owen

OK I keep thinking ALS-WR = weighted terms / implicit feedback but
that's not the case here it seems.
Well scratch that part, but I think the answer is still overfitting.

On Thu, May 9, 2013 at 2:45 PM, Gabor Bernat  wrote:
> I've used the constructor without that argument (or alpha). So I suppose
> those take the default value, which I suppose is an explicit model, am I
> right?
>
> Thanks,
>
> Bernát GÁBOR
>
>
> On Thu, May 9, 2013 at 3:40 PM, Sebastian Schelter
> wrote:
>
>> Our ALSWRFactorizer can do both flavors of ALS (the one used for
>> explicit and the one used for implicit data). @Gabor, what do you
>> specify for the constructor argument "usesImplicitFeedback" ?
>>
>>
>> On 09.05.2013 15:33, Sean Owen wrote:
>> > RMSE would have the same potential issue. ALS-WR is going to prefer to
>> > minimize one error at the expense of letting another get much larger,
>> > whereas RMSE penalizes them all the same.  It's maybe an indirect
>> > issue here at best -- there's a moderate mismatch between the metric
>> > and the nature of the algorithm.
>> >
>> > I think most of the explanation is simply overfitting then, as this is
>> > test set error. I still think it is weird that the lowest MAE occurs
>> > at f=1; maybe there's a good simple reason for that I'm missing off
>> > the top of my head.
>> >
>> > FWIW When I tune for best parameters on this data set, according to a
>> > mean average precision metric, I end up with an optimum more like 15
>> > features and lambda=0.05 (although, note, I'm using a different
>> > default alpha, 1, and a somewhat different definition of lambda).
>> >
>> >
>> >
>> > On Thu, May 9, 2013 at 2:11 PM, Gabor Bernat 
>> wrote:
>> >> I know, but the same is true for the RMSE.
>> >>
>> >> This is based on the Movielens 100k dataset, and by using the frameworks
>> >> (random) sampling to split that into a training and an evaluation set.
>> (the
>> >> RMSRecommenderEvaluator or
>> AverageAbsoluteDifferenceRecommenderEvaluators
>> >> paramters - evaluation 1.0, training 0.75).
>> >>
>> >> Bernát GÁBOR
>> >>
>> >>
>> >> On Thu, May 9, 2013 at 3:05 PM, Sean Owen  wrote:
>> >>
>> >>> (The MAE metric may also be a complicating issue... it's measuring
>> >>> average error where all elements are equally weighted, but as the "WR"
>> >>> suggests in ALS-WR, the loss function being minimized weights
>> >>> different elements differently.)
>> >>>
>> >>> This is based on a test set right, separate from the training set?
>> >>> If you are able, measure the MAE on your training set too. If
>> >>> overfitting is the issue, you should see low error on the training
>> >>> set, and higher error on the test set, when f is high and lambda is
>> >>> low.
>> >>>
>> >>> On Thu, May 9, 2013 at 1:49 PM, Gabor Bernat 
>> >>> wrote:
>> >>>> Hello,
>> >>>>
>> >>>> Here it is: http://i.imgur.com/3e1eTE5.png
>> >>>> I've used 75% for training and 25% for evaluation.
>> >>>>
>> >>>> Well reasonably lambda gives close enough results, however not better.
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>>
>> >>>> Bernát GÁBOR
>> >>>>
>> >>>>
>> >>>> On Thu, May 9, 2013 at 2:46 PM, Sean Owen  wrote:
>> >>>>
>> >>>>> This sounds like overfitting. More features lets you fit your
>> training
>> >>>>> set better, but at some point, fitting too well means you fit other
>> >>>>> test data less well. Lambda resists overfitting, so setting it too
>> low
>> >>>>> increases the overfitting problem.
>> >>>>>
>> >>>>> I assume you still get better test set results with a reasonable
>> lambda?
>> >>>>>
>> >>>>> On Thu, May 9, 2013 at 1:38 PM, Gabor Bernat 
>> >>>>> wrote:
>> >>>>>> Hello,
>> >>>>>>
>> >>>>>> So I've been testing out the ALSWR with the Movielensk 100k dataset,
>> >>> and
>> >>>>>> I've run in some strange stuff. An example of this you can see in
>> the
>> >>>>>> attached picture.
>> >>>>>>
>> >>>>>> So I've used feature count1,2,4,8,16,32, same for iteration and
>> >>> summed up
>> >>>>>> the results in a table. So for a lambda higher than 0.07 the more
>> >>>>> important
>> >>>>>> factor seems to be the iteration count, while increasing the feature
>> >>>>> count
>> >>>>>> may improve the result, however not that much. And this is what one
>> >>> could
>> >>>>>> expect from the algrithm, so that's okay.
>> >>>>>>
>> >>>>>> The strange stuff comes for lambdas smaller than 0.075. In this case
>> >>> the
>> >>>>>> more important part becames the feature count, hovewer not more but
>> >>> less
>> >>>>> is
>> >>>>>> better. Similary for the iteration count. Essentially the best score
>> >>> is
>> >>>>>> achieved for a really small lambda, and a single feature and
>> iteration
>> >>>>>> count. How is this possible, am I missing something?
>> >>>>>>
>> >>>>>>
>> >>>>>> Bernát GÁBOR
>> >>>>>
>> >>>
>>
>>

Re: ALSWR MovieLens 100k

2013-05-09 Thread Sean Owen

RMSE would have the same potential issue. ALS-WR is going to prefer to
minimize one error at the expense of letting another get much larger,
whereas RMSE penalizes them all the same.  It's maybe an indirect
issue here at best -- there's a moderate mismatch between the metric
and the nature of the algorithm.

I think most of the explanation is simply overfitting then, as this is
test set error. I still think it is weird that the lowest MAE occurs
at f=1; maybe there's a good simple reason for that I'm missing off
the top of my head.

FWIW When I tune for best parameters on this data set, according to a
mean average precision metric, I end up with an optimum more like 15
features and lambda=0.05 (although, note, I'm using a different
default alpha, 1, and a somewhat different definition of lambda).



On Thu, May 9, 2013 at 2:11 PM, Gabor Bernat  wrote:
> I know, but the same is true for the RMSE.
>
> This is based on the Movielens 100k dataset, and by using the frameworks
> (random) sampling to split that into a training and an evaluation set. (the
> RMSRecommenderEvaluator or AverageAbsoluteDifferenceRecommenderEvaluators
> paramters - evaluation 1.0, training 0.75).
>
> Bernát GÁBOR
>
>
> On Thu, May 9, 2013 at 3:05 PM, Sean Owen  wrote:
>
>> (The MAE metric may also be a complicating issue... it's measuring
>> average error where all elements are equally weighted, but as the "WR"
>> suggests in ALS-WR, the loss function being minimized weights
>> different elements differently.)
>>
>> This is based on a test set right, separate from the training set?
>> If you are able, measure the MAE on your training set too. If
>> overfitting is the issue, you should see low error on the training
>> set, and higher error on the test set, when f is high and lambda is
>> low.
>>
>> On Thu, May 9, 2013 at 1:49 PM, Gabor Bernat 
>> wrote:
>> > Hello,
>> >
>> > Here it is: http://i.imgur.com/3e1eTE5.png
>> > I've used 75% for training and 25% for evaluation.
>> >
>> > Well reasonably lambda gives close enough results, however not better.
>> >
>> > Thanks,
>> >
>> >
>> > Bernát GÁBOR
>> >
>> >
>> > On Thu, May 9, 2013 at 2:46 PM, Sean Owen  wrote:
>> >
>> >> This sounds like overfitting. More features lets you fit your training
>> >> set better, but at some point, fitting too well means you fit other
>> >> test data less well. Lambda resists overfitting, so setting it too low
>> >> increases the overfitting problem.
>> >>
>> >> I assume you still get better test set results with a reasonable lambda?
>> >>
>> >> On Thu, May 9, 2013 at 1:38 PM, Gabor Bernat 
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > So I've been testing out the ALSWR with the Movielensk 100k dataset,
>> and
>> >> > I've run in some strange stuff. An example of this you can see in the
>> >> > attached picture.
>> >> >
>> >> > So I've used feature count1,2,4,8,16,32, same for iteration and
>> summed up
>> >> > the results in a table. So for a lambda higher than 0.07 the more
>> >> important
>> >> > factor seems to be the iteration count, while increasing the feature
>> >> count
>> >> > may improve the result, however not that much. And this is what one
>> could
>> >> > expect from the algrithm, so that's okay.
>> >> >
>> >> > The strange stuff comes for lambdas smaller than 0.075. In this case
>> the
>> >> > more important part becames the feature count, hovewer not more but
>> less
>> >> is
>> >> > better. Similary for the iteration count. Essentially the best score
>> is
>> >> > achieved for a really small lambda, and a single feature and iteration
>> >> > count. How is this possible, am I missing something?
>> >> >
>> >> >
>> >> > Bernát GÁBOR
>> >>
>>

Re: ALSWR MovieLens 100k

2013-05-09 Thread Sean Owen

(The MAE metric may also be a complicating issue... it's measuring
average error where all elements are equally weighted, but as the "WR"
suggests in ALS-WR, the loss function being minimized weights
different elements differently.)

This is based on a test set right, separate from the training set?
If you are able, measure the MAE on your training set too. If
overfitting is the issue, you should see low error on the training
set, and higher error on the test set, when f is high and lambda is
low.

On Thu, May 9, 2013 at 1:49 PM, Gabor Bernat  wrote:
> Hello,
>
> Here it is: http://i.imgur.com/3e1eTE5.png
> I've used 75% for training and 25% for evaluation.
>
> Well reasonably lambda gives close enough results, however not better.
>
> Thanks,
>
>
> Bernát GÁBOR
>
>
> On Thu, May 9, 2013 at 2:46 PM, Sean Owen  wrote:
>
>> This sounds like overfitting. More features lets you fit your training
>> set better, but at some point, fitting too well means you fit other
>> test data less well. Lambda resists overfitting, so setting it too low
>> increases the overfitting problem.
>>
>> I assume you still get better test set results with a reasonable lambda?
>>
>> On Thu, May 9, 2013 at 1:38 PM, Gabor Bernat 
>> wrote:
>> > Hello,
>> >
>> > So I've been testing out the ALSWR with the Movielensk 100k dataset, and
>> > I've run in some strange stuff. An example of this you can see in the
>> > attached picture.
>> >
>> > So I've used feature count1,2,4,8,16,32, same for iteration and summed up
>> > the results in a table. So for a lambda higher than 0.07 the more
>> important
>> > factor seems to be the iteration count, while increasing the feature
>> count
>> > may improve the result, however not that much. And this is what one could
>> > expect from the algrithm, so that's okay.
>> >
>> > The strange stuff comes for lambdas smaller than 0.075. In this case the
>> > more important part becames the feature count, hovewer not more but less
>> is
>> > better. Similary for the iteration count. Essentially the best score is
>> > achieved for a really small lambda, and a single feature and iteration
>> > count. How is this possible, am I missing something?
>> >
>> >
>> > Bernát GÁBOR
>>

Re: ALSWR MovieLens 100k

2013-05-09 Thread Sean Owen

This sounds like overfitting. More features lets you fit your training
set better, but at some point, fitting too well means you fit other
test data less well. Lambda resists overfitting, so setting it too low
increases the overfitting problem.

I assume you still get better test set results with a reasonable lambda?

On Thu, May 9, 2013 at 1:38 PM, Gabor Bernat  wrote:
> Hello,
>
> So I've been testing out the ALSWR with the Movielensk 100k dataset, and
> I've run in some strange stuff. An example of this you can see in the
> attached picture.
>
> So I've used feature count1,2,4,8,16,32, same for iteration and summed up
> the results in a table. So for a lambda higher than 0.07 the more important
> factor seems to be the iteration count, while increasing the feature count
> may improve the result, however not that much. And this is what one could
> expect from the algrithm, so that's okay.
>
> The strange stuff comes for lambdas smaller than 0.075. In this case the
> more important part becames the feature count, hovewer not more but less is
> better. Similary for the iteration count. Essentially the best score is
> achieved for a really small lambda, and a single feature and iteration
> count. How is this possible, am I missing something?
>
>
> Bernát GÁBOR

Re: Question about evaluating a Recommender System

2013-05-08 Thread Sean Owen

Ah, yes that's right. Yes if you have a lot of these values, the test
is really not valid. It may look 'better' but isn't for just this
reason. You want to make sure the result doesn't have many of these or
else you would discard it. Look for log lines like "Unable to
recommend in X cases"

On Wed, May 8, 2013 at 8:00 PM, Zhongduo Lin  wrote:
> This accounts for why a neighborhood size of 2 always gives me the best
> result. Thank you!
>
>
> Best Regards,
> Jimmy
>
> Zhongduo Lin (Jimmy)
> MASc candidate in ECE department
> University of Toronto
>
> On 2013-05-08 2:40 PM, Alejandro Bellogin Kouki wrote:
>>
>> AFAIK, the recommender would predict a NaN, which will be ignored by the
>> evaluator.
>>
>> However, I am not sure if there is any way to know how many of these
>> were actually produced in the evaluation step, that is, something like
>> the count of predictions with a NaN value.
>>
>> Cheers,
>> Alex
>>
>> Zhongduo Lin escribió:
>>>
>>> Thank you for the quick response.
>>>
>>> I agree that a neighborhood size of 2 will make the predictions more
>>> sensible. But my concern is that a neighborhood size of 2 can only
>>> predict a very small proportion of preference for each users. Let's
>>> take a look at the previous example,  how can it predict item 4 if
>>> item 4 happens to be chosen as in the test set? I think this is quite
>>> common in my case as well as for Amazon or eBay, since the rating is
>>> very sparse. So I just don't know how it can still be run.
>>>
>>> User 1rated item 1, 2, 3, 4
>>> neighbour1 of user 1  rated item 1, 2
>>> neighbour2 of user 1  rated item 1, 3
>>>
>>>
>>> I wouldn't expect that the Root Mean Square error will have different
>>> performance than the Absolute difference, since in that case most of
>>> the predictions are close to 1, resulting a near zero error no matter
>>> I am using absolute difference or RMSE. How can I say "RMSE is worse
>>> relative to the variance of the data set" using Mahout? Unfortunately
>>> I got an error using the precision and recall evaluation method, I
>>> guess that's because the data are too sparse.
>>>
>>> Best Regards,
>>> Jimmy
>>>
>>>
>>> On 13-05-08 10:05 AM, Sean Owen wrote:
>>>>
>>>> It may be true that the results are best with a neighborhood size of
>>>> 2. Why is that surprising? Very similar people, by nature, rate
>>>> similar things, which makes the things you held out of a user's test
>>>> set likely to be found in the recommendations.
>>>>
>>>> The mapping you suggest is not that sensible, yes, since almost
>>>> everything maps to 1. Not surprisingly, most of your predictions are
>>>> near 1. That's "better" in an absolute sense, but RMSE is worse
>>>> relative to the variance of the data set. This is not a good mapping
>>>> -- or else, RMSE is not a very good metric, yes. So, don't do one of
>>>> those two things.
>>>>
>>>> Try mean average precision for a metric that is not directly related
>>>> to the prediction values.
>>>>
>>>> On Wed, May 8, 2013 at 2:45 PM, Zhongduo Lin  wrote:
>>>>>
>>>>> Thank you for your reply.
>>>>>
>>>>> I think the evaluation process involves randomly choosing the
>>>>> evaluation
>>>>> proportion. The problem is that I always get the best result when I set
>>>>> neighbors to 2, which seems unreasonable to me. Since there should
>>>>> be many
>>>>> test case that the recommender system couldn't predict at all. So
>>>>> why did I
>>>>> still get a valid result? How does Mahout handle this case?
>>>>>
>>>>> Sorry I didn't make myself clear for the second question. Here is the
>>>>> problem: I have a set of inferred preference ranging from 0 to 1000.
>>>>> But I
>>>>> want to map it to 1 - 5. So there can be many ways for mapping.
>>>>> Let's take a
>>>>> simple example, if the mapping rule is like the following:
>>>>>  if (inferred_preference < 995) preference = 1;
>>>>>  else preference = inferred_preference - 995.
>>>>>
>>>>> You can see that this is a really bad

Re: Question about evaluating a Recommender System

2013-05-08 Thread Sean Owen

It may be selected as a test item. Other algorithms can predict the
'4'. The test process is random so as to not favor one algorithm.
I think you are just arguing that the algorithm you are using isn't
good for your data -- so just don't use it. Is that not the answer?
I don't know what you mean by the mapping algorithm.

On Wed, May 8, 2013 at 4:17 PM, Zhongduo Lin  wrote:
> Thank you for your reply. So in the case that item 4 is in the test set,
> will Mahout just not take it into consideration or generate any preference
> instead? Any is it there any way to evaluate the mapping algorithm in
> Mahout?
>
> Best Regards,
> Jimmy
>

Re: Question about evaluating a Recommender System

2013-05-08 Thread Sean Owen

You can't predict item 4 in that case. that shows the weakness of
neighborhood approaches for sparse data. That's pretty much the story
-- it's all working correctly. Maybe you should not use this approach.

On Wed, May 8, 2013 at 4:00 PM, Zhongduo Lin  wrote:
> Thank you for the quick response.
>
> I agree that a neighborhood size of 2 will make the predictions more
> sensible. But my concern is that a neighborhood size of 2 can only predict a
> very small proportion of preference for each users. Let's take a look at the
> previous example,  how can it predict item 4 if item 4 happens to be chosen
> as in the test set? I think this is quite common in my case as well as for
> Amazon or eBay, since the rating is very sparse. So I just don't know how it
> can still be run.
>
>
> User 1rated item 1, 2, 3, 4
> neighbour1 of user 1  rated item 1, 2
> neighbour2 of user 1  rated item 1, 3
>
>
> I wouldn't expect that the Root Mean Square error will have different
> performance than the Absolute difference, since in that case most of the
> predictions are close to 1, resulting a near zero error no matter I am using
> absolute difference or RMSE. How can I say "RMSE is worse relative to the
> variance of the data set" using Mahout? Unfortunately I got an error using
> the precision and recall evaluation method, I guess that's because the data
> are too sparse.
>
> Best Regards,
> Jimmy
>
>
>
> On 13-05-08 10:05 AM, Sean Owen wrote:
>>
>> It may be true that the results are best with a neighborhood size of
>> 2. Why is that surprising? Very similar people, by nature, rate
>> similar things, which makes the things you held out of a user's test
>> set likely to be found in the recommendations.
>>
>> The mapping you suggest is not that sensible, yes, since almost
>> everything maps to 1. Not surprisingly, most of your predictions are
>> near 1. That's "better" in an absolute sense, but RMSE is worse
>> relative to the variance of the data set. This is not a good mapping
>> -- or else, RMSE is not a very good metric, yes. So, don't do one of
>> those two things.
>>
>> Try mean average precision for a metric that is not directly related
>> to the prediction values.
>>
>> On Wed, May 8, 2013 at 2:45 PM, Zhongduo Lin  wrote:
>>>
>>> Thank you for your reply.
>>>
>>> I think the evaluation process involves randomly choosing the evaluation
>>> proportion. The problem is that I always get the best result when I set
>>> neighbors to 2, which seems unreasonable to me. Since there should be
>>> many
>>> test case that the recommender system couldn't predict at all. So why did
>>> I
>>> still get a valid result? How does Mahout handle this case?
>>>
>>> Sorry I didn't make myself clear for the second question. Here is the
>>> problem: I have a set of inferred preference ranging from 0 to 1000. But
>>> I
>>> want to map it to 1 - 5. So there can be many ways for mapping. Let's
>>> take a
>>> simple example, if the mapping rule is like the following:
>>>  if (inferred_preference < 995) preference = 1;
>>>  else preference = inferred_preference - 995.
>>>
>>> You can see that this is a really bad mapping algorithms, but if we run
>>> the
>>> generated preference to Mahout, it is going to give me a really nice
>>> result
>>> because most of the preference is 1. So is there any other metric to
>>> evaluate this?
>>>
>>>
>>> Any help will be highly appreciated.
>>>
>>> Best Regards,
>>> Jimmy
>>>
>>>
>>> Zhongduo Lin (Jimmy)
>>> MASc candidate in ECE department
>>> University of Toronto
>>>
>>>
>>> On 2013-05-08 4:44 AM, Sean Owen wrote:
>>>>
>>>> It is true that a process based on user-user similarity only won't be
>>>> able to recommend item 4 in this example. This is a drawback of the
>>>> algorithm and not something that can be worked around. You could try
>>>> not to choose this item in the test set, but then that does not quite
>>>> reflect reality in the test.
>>>>
>>>> If you just mean that compressing the range of pref values improves
>>>> RMSE in absolute terms, yes it does of course. But not in relative
>>>> terms. There is nothing inherently better or worse about a small range
>>>> in this example.
>>>>
>>>> RMSE is a fine eval metric, but you can also considered mean average
>>>> precision.
>>>>
>>>> Sean
>
>

Re: Question about evaluating a Recommender System

2013-05-08 Thread Sean Owen

It may be true that the results are best with a neighborhood size of
2. Why is that surprising? Very similar people, by nature, rate
similar things, which makes the things you held out of a user's test
set likely to be found in the recommendations.

The mapping you suggest is not that sensible, yes, since almost
everything maps to 1. Not surprisingly, most of your predictions are
near 1. That's "better" in an absolute sense, but RMSE is worse
relative to the variance of the data set. This is not a good mapping
-- or else, RMSE is not a very good metric, yes. So, don't do one of
those two things.

Try mean average precision for a metric that is not directly related
to the prediction values.

On Wed, May 8, 2013 at 2:45 PM, Zhongduo Lin  wrote:
> Thank you for your reply.
>
> I think the evaluation process involves randomly choosing the evaluation
> proportion. The problem is that I always get the best result when I set
> neighbors to 2, which seems unreasonable to me. Since there should be many
> test case that the recommender system couldn't predict at all. So why did I
> still get a valid result? How does Mahout handle this case?
>
> Sorry I didn't make myself clear for the second question. Here is the
> problem: I have a set of inferred preference ranging from 0 to 1000. But I
> want to map it to 1 - 5. So there can be many ways for mapping. Let's take a
> simple example, if the mapping rule is like the following:
> if (inferred_preference < 995) preference = 1;
> else preference = inferred_preference - 995.
>
> You can see that this is a really bad mapping algorithms, but if we run the
> generated preference to Mahout, it is going to give me a really nice result
> because most of the preference is 1. So is there any other metric to
> evaluate this?
>
>
> Any help will be highly appreciated.
>
> Best Regards,
> Jimmy
>
>
> Zhongduo Lin (Jimmy)
> MASc candidate in ECE department
> University of Toronto
>
>
> On 2013-05-08 4:44 AM, Sean Owen wrote:
>>
>> It is true that a process based on user-user similarity only won't be
>> able to recommend item 4 in this example. This is a drawback of the
>> algorithm and not something that can be worked around. You could try
>> not to choose this item in the test set, but then that does not quite
>> reflect reality in the test.
>>
>> If you just mean that compressing the range of pref values improves
>> RMSE in absolute terms, yes it does of course. But not in relative
>> terms. There is nothing inherently better or worse about a small range
>> in this example.
>>
>> RMSE is a fine eval metric, but you can also considered mean average
>> precision.
>>
>> Sean

Re: Question about evaluating a Recommender System

2013-05-08 Thread Sean Owen

It is true that a process based on user-user similarity only won't be
able to recommend item 4 in this example. This is a drawback of the
algorithm and not something that can be worked around. You could try
not to choose this item in the test set, but then that does not quite
reflect reality in the test.

If you just mean that compressing the range of pref values improves
RMSE in absolute terms, yes it does of course. But not in relative
terms. There is nothing inherently better or worse about a small range
in this example.

RMSE is a fine eval metric, but you can also considered mean average precision.

Sean

On Wed, May 8, 2013 at 4:29 AM, Zhongduo Lin  wrote:
> Hi All,
>
> I am using the Mahout to build a user-based recommender system (RS). The
> evaluation method I am using is
> AverageAbsoluteDifferenceRecommenderEvaluator, which according to the
> "Mahout in Action" randomly sets aside some existing preference and
> calculate the difference between the predicted value and the real one. The
> first question I have is that in a user-based RS, if we choose a small
> number of neighbours, then it is quite possible that the prediction is not
> available at all. Here is an example:
>
> User 1 rated item 1, 2, 3, 4
> neighbour1 of user 1  rated item 1, 2
> neighbour2 of user 1  rated item 1, 3
>
> In the case above, the number of neighbours is two, so if we take out the
> rating of user 1 to item 4, there is no way to predict it. What will mahout
> deal with such a problem?
>
> Also, I am trying to map inferred preferences to a scale of 1-5. But the
> problem is that if I simply map all the preference to 1-2, then I will get a
> really nice evaluation result (almost 0), but you can easily see that this
> is not a right way to do it. So I guess the question is whether there is
> another way to evaluate the preference mapping algorithm.
>
> Any help will be highly appreciated.
>
> Best Regards,
> Jimmy

Re: Clustering product views and sales

2013-05-06 Thread Sean Owen

It sounds like you don't quite have a cold start problem. You have a
few behaviors, a few views or clicks, not zero. So you really just
need to find an approach that's quite comfortable with sparse input. A
low-rank factorization model like ALS works fine in this case, for
example.

There's a circularity problem in thinking about solving this with
clustering: if you have not enough data to recommend to users at the
start, on what data are you clustering them before that?

I don't think you need clustering either. (Of course, you can cluster
easily from the representation you get out of something like a
low-rank factorization. It can easily be an output rather than an
'input'.)

As to evaluation, it a depends a little on what you mean by frequent
item sets and evaluation. You say a result is good if it occurs
frequently overall with other items the user viewed? It makes some
sense, although it sounds like you're just testing if the recommender
does exactly what a item-similarity-based recommender would do when
based on co-occurrence between items. That is, if that's defined as
the right answer, then save yourself the trouble and build the
recommender to give exactly that answer?

Usually you see if the model recommends back things the user actually
viewed, that were held out of the training data. This has its own
problems but presupposing a correct algorithm isn't one of them.

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen

Yes, that's really what I mean. ALS factors, among other things, a
matrix of 1 where an interaction occurs and nothing (implicitly 0)
everywhere else.

On Mon, May 6, 2013 at 9:40 PM, Tevfik Aytekin  wrote:
> But the data under consideration here is not 0/1 data, it contains only 1's.
>
> On Mon, May 6, 2013 at 11:29 PM, Sean Owen  wrote:
>> Parallel ALS is exactly an example of where you can use matrix
>> factorization for "0/1" data.
>>
>> On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin  
>> wrote:
>>> Hi Sean,
>>> Isn't boolean preferences is supported in the context of memory-based
>>> recommendation algorithms in Mahout?
>>> Are there matrix factorization algorithms in Mahout which can work
>>> with this kind of data (that is, the kind of data which consists of
>>> users and the movies they have seen).
>>>
>>>
>>>
>>>
>>> On Mon, May 6, 2013 at 10:34 PM, Sean Owen  wrote:
>>>> Yes, it goes by the name 'boolean prefs' in the project since target
>>>> variables don't have values -- they just exist or don't.
>>>> So, yes it's certainly supported but the question here is how to
>>>> evaluate the output.
>>>>
>>>> On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin  
>>>> wrote:
>>>>> This problem is called one-class classification problem. In the domain
>>>>> of collaborative filtering it is called one-class collaborative
>>>>> filtering (since what you have are only positive preferences). You may
>>>>> search the web with these key words to find papers providing
>>>>> solutions. I'm not sure whether Mahout has algorithms for one-class
>>>>> collaborative filtering.
>>>>>
>>>>> On Mon, May 6, 2013 at 1:42 PM, Sean Owen  wrote:
>>>>>> ALS-WR weights the error on each term differently, so the average
>>>>>> error doesn't really have meaning here, even if you are comparing the
>>>>>> difference with "1". I think you will need to fall back to mean
>>>>>> average precision or something.
>>>>>>
>>>>>> On Mon, May 6, 2013 at 11:24 AM, William  
>>>>>> wrote:
>>>>>>> Sean Owen  gmail.com> writes:
>>>>>>>
>>>>>>>>
>>>>>>>> If you have no ratings, how are you using RMSE? this typically
>>>>>>>> measures error in reconstructing ratings.
>>>>>>>> I think you are probably measuring something meaningless.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I suppose the rate of seen movies are 1. Is it right?
>>>>>>> If I use Collaborative Filtering with ALS-WR to get some 
>>>>>>> recommendations, I
>>>>>>> must have a real rating-matrix?
>>>>>>>
>>>>>>>
>>>>>>>

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen

Parallel ALS is exactly an example of where you can use matrix
factorization for "0/1" data.

On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin  wrote:
> Hi Sean,
> Isn't boolean preferences is supported in the context of memory-based
> recommendation algorithms in Mahout?
> Are there matrix factorization algorithms in Mahout which can work
> with this kind of data (that is, the kind of data which consists of
> users and the movies they have seen).
>
>
>
>
> On Mon, May 6, 2013 at 10:34 PM, Sean Owen  wrote:
>> Yes, it goes by the name 'boolean prefs' in the project since target
>> variables don't have values -- they just exist or don't.
>> So, yes it's certainly supported but the question here is how to
>> evaluate the output.
>>
>> On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin  
>> wrote:
>>> This problem is called one-class classification problem. In the domain
>>> of collaborative filtering it is called one-class collaborative
>>> filtering (since what you have are only positive preferences). You may
>>> search the web with these key words to find papers providing
>>> solutions. I'm not sure whether Mahout has algorithms for one-class
>>> collaborative filtering.
>>>
>>> On Mon, May 6, 2013 at 1:42 PM, Sean Owen  wrote:
>>>> ALS-WR weights the error on each term differently, so the average
>>>> error doesn't really have meaning here, even if you are comparing the
>>>> difference with "1". I think you will need to fall back to mean
>>>> average precision or something.
>>>>
>>>> On Mon, May 6, 2013 at 11:24 AM, William  wrote:
>>>>> Sean Owen  gmail.com> writes:
>>>>>
>>>>>>
>>>>>> If you have no ratings, how are you using RMSE? this typically
>>>>>> measures error in reconstructing ratings.
>>>>>> I think you are probably measuring something meaningless.
>>>>>>
>>>>>
>>>>>
>>>>> I suppose the rate of seen movies are 1. Is it right?
>>>>> If I use Collaborative Filtering with ALS-WR to get some recommendations, 
>>>>> I
>>>>> must have a real rating-matrix?
>>>>>
>>>>>
>>>>>

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen

Yes, it goes by the name 'boolean prefs' in the project since target
variables don't have values -- they just exist or don't.
So, yes it's certainly supported but the question here is how to
evaluate the output.

On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin  wrote:
> This problem is called one-class classification problem. In the domain
> of collaborative filtering it is called one-class collaborative
> filtering (since what you have are only positive preferences). You may
> search the web with these key words to find papers providing
> solutions. I'm not sure whether Mahout has algorithms for one-class
> collaborative filtering.
>
> On Mon, May 6, 2013 at 1:42 PM, Sean Owen  wrote:
>> ALS-WR weights the error on each term differently, so the average
>> error doesn't really have meaning here, even if you are comparing the
>> difference with "1". I think you will need to fall back to mean
>> average precision or something.
>>
>> On Mon, May 6, 2013 at 11:24 AM, William  wrote:
>>> Sean Owen  gmail.com> writes:
>>>
>>>>
>>>> If you have no ratings, how are you using RMSE? this typically
>>>> measures error in reconstructing ratings.
>>>> I think you are probably measuring something meaningless.
>>>>
>>>
>>>
>>> I suppose the rate of seen movies are 1. Is it right?
>>> If I use Collaborative Filtering with ALS-WR to get some recommendations, I
>>> must have a real rating-matrix?
>>>
>>>
>>>

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen

ALS-WR weights the error on each term differently, so the average
error doesn't really have meaning here, even if you are comparing the
difference with "1". I think you will need to fall back to mean
average precision or something.

On Mon, May 6, 2013 at 11:24 AM, William  wrote:
> Sean Owen  gmail.com> writes:
>
>>
>> If you have no ratings, how are you using RMSE? this typically
>> measures error in reconstructing ratings.
>> I think you are probably measuring something meaningless.
>>
>
>
> I suppose the rate of seen movies are 1. Is it right?
> If I use Collaborative Filtering with ALS-WR to get some recommendations, I
> must have a real rating-matrix?
>
>
>

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen

If you have no ratings, how are you using RMSE? this typically
measures error in reconstructing ratings.
I think you are probably measuring something meaningless.

On Mon, May 6, 2013 at 10:17 AM, William  wrote:
> I have a dataset about user and movie(no rate).But I want to get some
> recommendations from this dataset.
> I just know the users see or not see some movie.So I set the rating matrix
> like: seen movies are 1, not seen movies are missing.
> I use parallelALS function to decompose this matrix with three
> parameters(numfeatures ,numIterations, lambda). And I would like to get the
> best combination to reduce the RMSE.
> I my experiment, RMSE value decreases with larger numIterations. But it
> increases with larger numfeatures. I use the another rating-matrix(from
> mahout official website) to experiment, everything is fine.
> So How to explain it? Can't I assign all rates are 1?

Re: Mahout & database pooling best practice

2013-05-01 Thread Sean Owen

Rather, it needs to extend ConnectionPoolDataSource. But you can
ignore it if you're sure you are using a pooling implementation. You
might just double-check that.

On Wed, May 1, 2013 at 9:25 AM, Mugoma Joseph O.  wrote:
> Thanks Sean.
>
> From source, AbstractJDBCDataModel.java expects any data source to be an
> instance of ConnectionPoolDataSource:
>
>
> 192 if (!(dataSource instanceof ConnectionPoolDataSource)) {
> 193 log.warn("You are not using ConnectionPoolDataSource. Make sure your
> DataSource pools connections "
> 194 + "to the database itself, or database performance will be severely
> reduced.");
> 195 }
>
>
> This means that to be accepted a data source need to be wrapped around
> ConnectionPoolDataSource.
>
>
> Is this interpretation correct? If it is, then is it okay to have a
> pooling data source wrapped around another pooling data source?
>
> Thanks.
>

Re: Time Based Recommender System

2013-04-30 Thread Sean Owen

GraphLab -- http://docs.graphlab.org/collaborative_filtering.html#SVD_PLUS_PLUS

On Tue, Apr 30, 2013 at 3:30 PM, Chirag Lakhani  wrote:
> Do you know of any other large scale machine learning platforms that do
> incorporate it?
>
>
> On Tue, Apr 30, 2013 at 10:21 AM, Sean Owen  wrote:
>
>> No, time is in the data model but nothing uses it that I know of.
>>
>> On Tue, Apr 30, 2013 at 3:18 PM, Chirag Lakhani 
>> wrote:
>> > I was wondering if the collaborative filtering library in Mahout has any
>> > algorithms that incorporate concept drift i.e. time dynamics.  From my
>> own
>> > research I have come across the BellKor algorithm called TimeSVD++ and
>> > there is a recent paper using hidden markov models with collaborative
>> > filtering.  Has anything of this sort been implemented in Mahout?
>> >
>> > I am working on a problem where a company offers many different services
>> > and the dataset is their subscriber's subscription levels for each
>> service
>> > at different time snapshots.  I am interested in using some sort of
>> >  recommender system that can give recommendations for new services based
>> on
>> > such data.
>>

Re: Time Based Recommender System

2013-04-30 Thread Sean Owen

No, time is in the data model but nothing uses it that I know of.

On Tue, Apr 30, 2013 at 3:18 PM, Chirag Lakhani  wrote:
> I was wondering if the collaborative filtering library in Mahout has any
> algorithms that incorporate concept drift i.e. time dynamics.  From my own
> research I have come across the BellKor algorithm called TimeSVD++ and
> there is a recent paper using hidden markov models with collaborative
> filtering.  Has anything of this sort been implemented in Mahout?
>
> I am working on a problem where a company offers many different services
> and the dataset is their subscriber's subscription levels for each service
> at different time snapshots.  I am interested in using some sort of
>  recommender system that can give recommendations for new services based on
> such data.

Re: Fold-in for ALSWR

2013-04-30 Thread Sean Owen

I should say that it depends of course on what you are implementing.
You can also write an algorithm to factor R, not P. If you're doing
that, then I would not expect values to be so low. But I thought you
were following the version where you factor P = R != 0.

Multiplying by 3 and adding 1 would not do what you want, no.

On Tue, Apr 30, 2013 at 2:24 PM, Chloe  wrote:
> Thanks again for replying.
>
> I didn't expect that since I'm using explicit feedback, not implicit, but
> mostly because the part files in U/ and V/ multiplied together give me back
> predicted ratings on a 1-4 scale.
>
> Would converting the 0/1 connection indicator to a 1-4 scale be any sort of
> reasonable on capturing the strength of the connection or is that entirely
> unjustified?
>
> -Chloe
>

Re: Fold-in for ALSWR

2013-04-29 Thread Sean Owen

ALS-WR is not predicting your input matrix R, but the matrix P which
is R != 0. It is not predicting ratings, but a 0/1 indicator of
whether the connection exists. So the values are usually in [0,1].

On Tue, Apr 30, 2013 at 2:40 AM, Chloe  wrote:
> Dear Sean,
>
> Thanks a lot for a quick and helpful reply. Having been sidetracked with
> another project, I revisited the problem I posed in my post over the weekend
> and, unfortunately, have a follow up question.
>
> The problem I'm facing with my implementation of your explanation is that
> the predicted ratings for new users seem to be on a very different scale
> than the original ratings the model is based on and I'm wondering what I've
> done wrong.
>
> To recap my steps in pseudocode:
> 1. Use a text file of ratings on 1-4 scale to generate my model afterward
> given by files U/part-m-0 and V/part-m-0, or Ratings = UV'.
>
> 2. Vector newRatings = new Vector(); ex. given 10 items a new user's ratings
> looks like {0,1,0,3,4,0,2,3,0,1}
> Matrix Au = new DenseMatrix(newRatings.size(), 1);
> Au.assignColumn(0, newRatings);
> QRDecomposition qr = new QRDecomposition(V);  //item features
> Matrix Xu = qr.solve(Au);
> Matrix predictedUserRatingsForAllItems =
> (Xu.transpose()).times(V.transpose());
>
> 3. DenseVector predictedUserRatingsVector =
> (DenseVector)predictedUserRatingsForAllItems.viewRow(0);
>
> The "predictedUserRatingsVector" from step 3, however, gives a top 10 item
> result with scores ranging from 0.46-0.62. These numbers go up with the
> number of new items rated. Which means that even for item 5, given the
> highest possible score of 4, this approach can't even give back a rating for
> a rated item close to its actual value.
> Moreover, the new user's ratings I test, {0,1,0,3,4,0,2,3,0,1}, are actually
> identical to an existing user that was used to build the model and whose
> predicted ratings are very reasonable, looking like
> {0.5,0.98,0.89,3.23,4.1,1.01,2.32,2.99,3.5,1.1}.
>
> I must be doing something wrong or missing something. Is there anything you
> or anyone from the list with fold-in experience can suggest I try or
> consider that would explain why this is happening? I expected that predicted
> ratings from fold-in would not be as good as regenerating the model, but not
> this bad.
>
> Many thanks,
> Chloe
>
>
>
>

Re: Mahout & database pooling best practice

2013-04-29 Thread Sean Owen

If you are actually using a connection pool, ignore it, it just means
the implementation doesn't appear to extend the usual connection pool
class in the JDK. Just make sure you are in fact using this class and
you're fine.

On Tue, Apr 30, 2013 at 4:01 AM, Mugoma Joseph O.  wrote:
> Hello,
>
> I am using mahout with MySQL
> (com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource).
>
> However, I keep getting this warning:
>
> o.a.m.c.t.impl.model.jdbc.AbstractJDBCDataModel |WARN | 2013-04-18
> 08:35:40,127 | qtp1380741586-307 | You are not using
> ConnectionPoolDataSource. Make sure your DataSource pools connections to
> the database itself, or database performance will be severely reduced.
>
>
> What's the best practice when connecting mahout to database?
>
> Thanks in advance.
>
> Mugoma.
>
>

Re: Mahout Similarity Caching

2013-04-23 Thread Sean Owen

I agree, but how is "pre-adding a cached value for X" different than
"requesting X from the cache"? Either way you get X in the cache.
Computing offline seems the same as computing on-line, but in some
kind of warm-up state or phase. Which can be concurrent with serving
early requests even. You can do everything else you say without a new
operation, like selectively pre-caching certain entries.

On Tue, Apr 23, 2013 at 1:14 PM, Gabor Bernat  wrote:
> CachingSimilarity also allowed to add manually entries, because in that
> case this task could be pushed off to an offline system. And yes, you
> cannot add all the similarities to the caching object, however based on
> history you can select some top (popular) item pairs, and just calculate
> for that subset. This could push down the upper request times. Any other
> ideas?
>

Re: Mahout Similarity Caching

2013-04-23 Thread Sean Owen

That still sounds far too high, and it would be interested to profile
to see exactly what's slow. A recommendation entails making estimates
for most or all items, and so should be about as fast as making
estimates directly for a few thousand. Tanimoto similarity is trivial.
In fact it may be slowing things down to cache it.

You can 'warm up' the cache by requesting similarities, which will
then be cached. There's no real point in a separate method to give it
a cached value -- it can figure those out. The problem is, what do you
cache? you can't cache everything and you don't know what's needed
ahead of time.

Something else is not right here I think, like, the measurement is
including some other time.

On Tue, Apr 23, 2013 at 6:20 AM, Gabor Bernat  wrote:
> Nope, and nope.
>
> Note that this is an outlier example, however even in other cases it does
> takes 500ms+ which is way to much for what I need.
>
> Thanks,
>
> Bernát GÁBOR
>
>
> On Tue, Apr 23, 2013 at 12:53 AM, Sean Owen  wrote:
>
>> 49 seconds is orders of magnitude too long -- something is very wrong
>> here, for so little data. Are you running this off a database? or are
>> you somehow counting the overhead of 3-4K network calls?
>>
>> On Mon, Apr 22, 2013 at 11:22 PM, Gabor Bernat 
>> wrote:
>> > Hello,
>> >
>> > I'm using Mahout in a system, where the typical response time should be
>> > below 100ms. I'm using an item based recommender with float preference
>> > values (with Tanimato similarity for now, which is passed into a
>> > CachingItemSimilarity objec for performance reasonst). My model has
>> around
>> > 7k items, 26k users with around 100k preferences linking them.
>> >
>> > Instead of performing a recommendation, I only need to estimate
>> preferences
>> > of the user for around 3-4k items (this is important, as this allows the
>> > integration of a business rule engine in the recommendation process
>> inside
>> > the system where I'm working).
>> >
>> > Now my problem is that for users with lots of preferences (200+) this
>> > estimation process takes forever (49second+). I'm assuming the issue lies
>> > into the calculation of the similarity measurements; so I though I'll do
>> > this asynchroniously in a train like process, save it, and at start up
>> just
>> > load it into memory this precomputed information. However, I cannot see
>> any
>> > way to load this information into the CachingSimilarity object; nor can I
>> > persist the CachingSimilarity object and load it.
>> >
>> > So any ideas, on how to cut down the estimation times?
>> >
>> > Thanks,
>> >
>> > Bernát GÁBOR
>>

Re: Mahout Similarity Caching

2013-04-22 Thread Sean Owen

49 seconds is orders of magnitude too long -- something is very wrong
here, for so little data. Are you running this off a database? or are
you somehow counting the overhead of 3-4K network calls?

On Mon, Apr 22, 2013 at 11:22 PM, Gabor Bernat  wrote:
> Hello,
>
> I'm using Mahout in a system, where the typical response time should be
> below 100ms. I'm using an item based recommender with float preference
> values (with Tanimato similarity for now, which is passed into a
> CachingItemSimilarity objec for performance reasonst). My model has around
> 7k items, 26k users with around 100k preferences linking them.
>
> Instead of performing a recommendation, I only need to estimate preferences
> of the user for around 3-4k items (this is important, as this allows the
> integration of a business rule engine in the recommendation process inside
> the system where I'm working).
>
> Now my problem is that for users with lots of preferences (200+) this
> estimation process takes forever (49second+). I'm assuming the issue lies
> into the calculation of the similarity measurements; so I though I'll do
> this asynchroniously in a train like process, save it, and at start up just
> load it into memory this precomputed information. However, I cannot see any
> way to load this information into the CachingSimilarity object; nor can I
> persist the CachingSimilarity object and load it.
>
> So any ideas, on how to cut down the estimation times?
>
> Thanks,
>
> Bernát GÁBOR

Re: "Error creating assembly archive job: error in opening zip file"

2013-04-18 Thread Sean Owen

Probably a corrupt download inside Maven. Delete ~/.m2/repository entirely
On Apr 19, 2013 12:23 AM, "Dmitriy Lyubimov"  wrote:

> Hm. This is really not a known error. Which suggests something really
> platitudinarian: open file handle limits? lack of disk space? Sorry if
> that's not really helpful but it is not something i can repeat.
>
>
> On Thu, Apr 18, 2013 at 3:48 PM, Philipp Defner 
> wrote:
>
> > I updated maven to 3.0.4 now but the problem is still around.
> >
> > ==
> > Apache Maven 3.0.4
> > Maven home: /usr/share/maven
> > Java version: 1.6.0_27, vendor: Sun Microsystems Inc.
> > Java home: /usr/lib/jvm/java-6-openjdk-amd64/jre
> > Default locale: en_US, platform encoding: UTF-8
> > OS name: "linux", version: "3.5.0-27-generic", arch: "amd64", family:
> > "unix"
> >
> > but I'm still getting the following error message after running "mvm
> > install"
> >
> >
> > ==
> >
> > Results :
> >
> > Tests run: 689, Failures: 0, Errors: 0, Skipped: 0
> >
> > [INFO]
> > [INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ mahout-core ---
> > [INFO] Building jar:
> > /usr/local/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT.jar
> > [INFO]
> > [INFO] --- maven-jar-plugin:2.4:test-jar (default) @ mahout-core ---
> > [INFO] Building jar:
> > /usr/local/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-tests.jar
> > [INFO]
> > [INFO] --- maven-assembly-plugin:2.4:single (job) @ mahout-core ---
> > [INFO] Reading assembly descriptor: src/main/assembly/job.xml
> > [WARNING] Invalid POM for asm:asm:jar:3.1, transitive dependencies (if
> > any) will not be available, enable debug logging for more details
> > [INFO]
> > 
> > [INFO] Reactor Summary:
> > [INFO]
> > [INFO] Apache Mahout . SUCCESS
> [1.748s]
> > [INFO] Mahout Build Tools  SUCCESS
> [2.006s]
> > [INFO] Mahout Math ... SUCCESS
> > [2:11.278s]
> > [INFO] Mahout Core ... FAILURE
> > [11:58.673s]
> > [INFO] Mahout Integration  SKIPPED
> > [INFO] Mahout Examples ... SKIPPED
> > [INFO] Mahout Release Package  SKIPPED
> > [INFO]
> > 
> > [INFO] BUILD FAILURE
> > [INFO]
> > 
> > [INFO] Total time: 14:15.100s
> > [INFO] Finished at: Fri Apr 19 00:43:02 CEST 2013
> > [INFO] Final Memory: 29M/433M
> > [INFO]
> > 
> > [ERROR] Failed to execute goal
> > org.apache.maven.plugins:maven-assembly-plugin:2.4:single (job) on
> project
> > mahout-core: Failed to create assembly: Error creating assembly archive
> > job: error in opening zip file -> [Help 1]
> > [ERROR]
> > [ERROR] To see the full stack trace of the errors, re-run Maven with the
> > -e switch.
> > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> > [ERROR]
> > [ERROR] For more information about the errors and possible solutions,
> > please read the following articles:
> > [ERROR] [Help 1]
> > http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> > [ERROR]
> > [ERROR] After correcting the problems, you can resume the build with the
> > command
> > [ERROR]   mvn  -rf :mahout-core
> >
> > ==
> >
> > So it's basically still the same "Failed to create assembly: Error
> > creating assembly archive job: error in opening zip file" error.
> >
> > Thanks for your input so far.
> >
> > On Apr 18, 2013, at 11:40 PM, Suneel Marthi 
> > wrote:
> >
> > > Well I guess its maven 2.2.1 then, upgrade to maven 3+ and give it a
> > shot.
> > >
> > >
> > >
> > >
> > > 
> > > From: Philipp Defner 
> > > To: user@mahout.apache.org; Suneel Marthi 
> > > Sent: Thursday, April 18, 2013 5:36 PM
> > > Subject: Re: "Error creating assembly archive job: error in opening zip
> > file"
> > >
> > >
> > > I'm using Apache Maven 2.2.1 and I don't think it's losing the
> > connection because it's the same error across different servers.
> > >
> > >
> > > On Apr 18, 2013, at 11:31 PM, Suneel Marthi 
> > wrote:
> > >
> > >> Usually happens if Maven looses connection to the repo from my
> > experience, could u try again?
> > >> Also r u using Maven 3?
> > >>
> > >>
> > >>
> > >>
> > >> 
> > >> From: Philipp Defner 
> > >> To: user@mahout.apache.org
> > >> Sent: Thursday, April 18, 2013 5:22 PM
> > >> Subject: "Error creating assembly archive job: error in opening zip
> > file"
> > >>
> > >>
> > >> Hello,
> > >>
> > >> I'm having some problems with installing Mahout. I setup a clean
> Ubuntu
> > 12.10 x64 VM with:
> > >>
> > >> java -version
> > >> ==
> > >>
> > >> java

Re: Boosting User-Based with the user's attributes

2013-04-17 Thread Sean Owen

If all of your similarities are a product like this, then they're all
"low". In a relative sense this is fine.
But this is also why I proposed a geometric mean instead. For example
the geometric mean of these is about 0.424 and this notion can be
extended to include weights as well, which is what may make it
particularly interesting to you since you mentioned weighting.

On Wed, Apr 17, 2013 at 3:56 PM, Agata Filiana  wrote:
> Just a thought, when you say to combine the metrics by multiplying their,
> for example Sim1 = 0.9 and Sim2 = 0.2
> When they are multiplied it would give a result of 0.18 which is very low,
> remembering that they are pretty "similar" based on Sim1 - how can this
> problem be tackled?
>
> *
>
> Agata Filiana
> Erasmus Mundus DMKM Student 2011-2013 <http://www.em-dmkm.eu/>
> *
>
>
> On 16 April 2013 16:41, Agata Filiana  wrote:
>
>> Thanks a lot for the insight,very useful!
>>
>>
>> *
>>
>> Agata Filiana
>> Erasmus Mundus DMKM Student 2011-2013 <http://www.em-dmkm.eu/>
>> *
>>
>>
>> On 16 April 2013 16:40, Sean Owen  wrote:
>>
>>> Of course it's not meaningless. They provide a basis for ranking
>>> items, so you can return top-K recommendations.
>>> If it's normally based on similarity and ratings -- and you have no
>>> ratings -- similarity is of course the only thing you can base the
>>> result on.
>>>
>>> On Tue, Apr 16, 2013 at 3:36 PM, Agata Filiana 
>>> wrote:
>>> > Well right now, I am only using one boolean file -just from from this
>>> > history of reading.
>>> > So you are saying the values generated in
>>> > the GenericBooleanPrefUserBasedRecommender is actually useless in this
>>> case
>>> > of no ratings and that it is merely based on the similarity only?
>>>
>>
>>

Re: Boosting User-Based with the user's attributes

2013-04-16 Thread Sean Owen

Of course it's not meaningless. They provide a basis for ranking
items, so you can return top-K recommendations.
If it's normally based on similarity and ratings -- and you have no
ratings -- similarity is of course the only thing you can base the
result on.

On Tue, Apr 16, 2013 at 3:36 PM, Agata Filiana  wrote:
> Well right now, I am only using one boolean file -just from from this
> history of reading.
> So you are saying the values generated in
> the GenericBooleanPrefUserBasedRecommender is actually useless in this case
> of no ratings and that it is merely based on the similarity only?

Re: Boosting User-Based with the user's attributes

2013-04-16 Thread Sean Owen

In the usual recommender, the output is a weighted average of ratings.
In a model where there are no ratings, this has no meaning --
everything is "1" implicitly. So the output is something else, and
here it's a sum of similarities actually.

On Tue, Apr 16, 2013 at 3:05 PM, Agata Filiana  wrote:
> Sorry my mistake!
> LogLikelihoodSimilarity is giving me [0,1], however when I
> call GenericBooleanPrefUserBasedRecommender for the recommendation it is
> not giving me values [0,1]. That's what I meant.
>

Re: Boosting User-Based with the user's attributes

2013-04-16 Thread Sean Owen

That shouldn't be possible, are you sure? it's 1 - 1/(1+LLR) where LLR
is nonnegative.
Similarities are in [-1,1] and some are in [0,1].

On Tue, Apr 16, 2013 at 2:51 PM, Agata Filiana  wrote:
> Hi Sean,
>
> I see your point.
> I think I better experiment with those different options.
>
> I'd also like to ask if the result of LogLikelihoodSimilarity is between
> [0,1] ? It seems that I'm getting results higher than 1. So if like you
> said combining the different attributes can be done by multiplying them and
> normalizing them to [0,1] - what is the best method for normalization?
>

Re: Boosting User-Based with the user's attributes

2013-04-16 Thread Sean Owen

Broadly the idea makes sense, but I think this is getting into hacking
heuristics together without a lot of principle. The result will
probably work, and you can just proceed as you say -- make up some
weights and use them to weight the various similarities. If you are
using the product of similarity values, you can compute something like
a weighted geometric mean.
https://en.wikipedia.org/wiki/Geometric_mean

A step in a more principled direction is to consider these various
things as "items" -- things you read, hobbies you engage in, interests
you have. Then create a recommender on top of all of these things,
weighting the input differently. The often-mentioned ALS-WR is one of
several processes that fits since it has an explicit notion of input
weight.

On Tue, Apr 16, 2013 at 11:24 AM, Agata Filiana  wrote:
> Hi,
>
> Continuing this discussion - I have the implementation, but I'd like to
> know your opinion.
> As I said before, I am creating a new implementation of UserSimilarity as
> Sean pointed out.
> Does it make sense to put weights into these metrics? Say I combined 3
> similarity metrics: reading history, hobbies and interests.
> I would like my recommender to be "based" on history but boosted with
> weighted hobbies and interests with different weight, for example interests
> is more important than hobbies.
>
> Does that make sense? And how would you go about to implement it if it does
> make sense?
>
> Thank you again!
>
>
> *
>
> Agata Filiana
> Erasmus Mundus DMKM Student 2011-2013 <http://www.em-dmkm.eu/>
> *
>
>
> On 19 March 2013 12:03, Agata Filiana  wrote:
>
>> Ok, I will try that.
>>
>> Thanks for the help Sean!
>>
>>
>> On 19 March 2013 12:02, Sean Owen  wrote:
>>
>>> Write a new implementation of UserSimilarity that internally calls 2 other
>>> similarity metrics with the same arguments when asked for a similarity.
>>> Return their product.
>>>
>>>
>>> On Tue, Mar 19, 2013 at 6:59 AM, Agata Filiana >> >wrote:
>>>
>>> > I understand that, I guess what I am confused is the implementation of
>>> > merging the two similarity metrics in code. For example I apply
>>> > LogLikelihoodSimilarity for both item and hobby, and I have 2
>>> > UserSimilarity metrics. Then from there I am unsure of how to combine
>>> the
>>> > two.
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> *Agata Filiana
>> *
>>

Re: log-likelihood ratio value in item similarity calculation

2013-04-12 Thread Sean Owen

Yes that's true, it is more usually bits. Here it's natural log / nats.
Since it's unnormalized anyway another constant factor doesn't hurt and it
means not having to change the base.

On Fri, Apr 12, 2013 at 8:01 AM, Phoenix Bai  wrote:

> I got 168, because I use log base 2 instead of e.
> ([?]) if memory serves right, I read it in entropy definition that people
> normally use base 2, so I just assumed it was 2 in code. (my bad).
>
> And now I have a better understanding, so thank you both for the
> explanation.
>
>

Re: log-likelihood ratio value in item similarity calculation

2013-04-11 Thread Sean Owen

Yes I also get (er, Mahout gets) 117 (116.69), FWIW.

I think the second question concerned counts vs relative frequencies
-- normalized, or not. Like whether you divide all the counts by their
sum or not. For a fixed set of observations that does change the LLR
because it is unnormalized, not because the situation has changed.

Obviously you're right that the changing situations you describe do
entail a change in LLR!

On Thu, Apr 11, 2013 at 10:52 PM, Ted Dunning  wrote:
> These numbers don't match what I get.
>
> I get LLR = 117.
>
> This is wildly anomalous so this pair should definitely be connected.  Both
> items are quite rare (15/300,000 or 20/300,000 rates) but they occur
> together most of the time that they appear.
>
>
>
> On Wed, Apr 10, 2013 at 2:15 AM, Phoenix Bai  wrote:
>
>> Hi,
>>
>> the counts for two events are:
>> * **Event A**Everything but A**Event B**k11=7**k12=8**Everything but B**
>> k21=13**k22=300,000*
>> according to the code, I will get:
>>
>> rowEntropy = entropy(7,8) + entropy(13, 300,000) = 222
>> colEntropy = entropy(7,13) + entropy(8, 300,000) = 152
>> matrixEntropy(entropy(7,8,13, 300,000) = 458
>>
>> thus,
>>
>> LLR=2.0*(458-222-152) = 168
>> similarityScore = 1 - 1/(1+168) = 0.994
>>
>> So, my problem is,
>> the similarity scores I get for all the items are all this high and it
>> makes it so hard to identify the real similar ones.
>>
>> As you can see, the counts of event A, and B are quite small while the
>> total count for k22 is quite high. And this phenomenon is quite common in
>> my dataset.
>>
>> So, my question is,
>> what kind of adjustment could I do to lower the similarity score to a more
>> reasonable range?
>>
>> Please shed some lights, thanks in advance!
>>

Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Sean Owen

You can actually create a "user" #6 for your new order. Or you can use
the "anonymous user" function of the library, although it's hacky.

We may be mixing up terms here. "DataModel" is a class that has
nothing to do with Hadoop. Hadoop in turn has no part in real-time
anything, like recommending to a brand-new "user". However it could
build an offline model of item-item similarities and you could do
something like a most-similar-items computation for a given new basket
of goods. That is effectively what the "anonymous user" function is
doing anyway.

You can precompute all recommendations for all items but that's a lot
of work! It's easy to get away with it with a thousand items, but with
a million this may be infeasibly slow.

On Thu, Apr 11, 2013 at 10:38 PM, Billy  wrote:
> As in the example data 'intro.csv' in the MIA it has users 1-5 so if I ask
> for recommendations for user 1 then this works but if I ask for
> recommendations for user 6 (a new user yet to be added to the data model)
> then I get no recommendations ... so if I substitute users for orders then
> again I will get no recommendations ... which I sort of understand so do I
> need to inject my 'new' active order; along with its attached item/s into
> the data model first and then ask for the recommendations for the order by
> offering up the new orderId? or is there a way of merely offering up an
> 'item' and then getting recommendations based merely on the item using the
> data already stored and the relationships with my item?
>
> My assumptions:
> #1
> I am assuming the data model is a static island of data that has been
> processed (flattened) overnight (most probably by an Hadoop process) due to
> the size of this data ... rather than a living document that is updated as
> soon as new data is available.
> #2
> I'm also assuming that instead of reading in the data model and
> providing recommendations 'on the fly' I will have to run thru every item
> in my catalogue and find out the top 5 recommended items that are ordered
> with each item (most probably via a Hadoop process) and then store this
> output in dynamoDb or luncene for quick access.
>
> Sorry for all the questions but it such an interesting subject.
>
>
> On 11 April 2013 22:04, Ted Dunning  wrote:
>
>> Actually, making this user based is a really good thing because you get
>> recommendations from one session to the next.  These may be much more
>> valuable for cross-sell than things in the same order.
>>
>>
>> On Thu, Apr 11, 2013 at 12:50 PM, Sean Owen  wrote:
>>
>>> You can try treating your orders as the 'users'. Then just compute
>>> item-item similarities per usual.
>>>
>>> On Thu, Apr 11, 2013 at 7:59 PM, Billy  wrote:
>>> > Thanks for replying,
>>> >
>>> >
>>> > I don't have users, well I do :-) but in this case it should not
>>> influence
>>> > the recommendations
>>> >
>>> > ,
>>> > these need to be based on the relationship between
>>> > "
>>> > items ordered with other items
>>> > in the 'same order'
>>> > ".
>>> >
>>> > E.g. If item #1 has been order with item #4
>>> >
>>> > [
>>> > 22
>>> > ]
>>> > times and item #1 has been order with item #9
>>> > [
>>> > 57
>>> > ]
>>> > times then
>>> > if I added item #1 to my order
>>> > these would both be recommended
>>> > but item #9 would be recommended above item #4 purely based on the fact
>>> that
>>> > the relationship between item #1 and item #9 is greater than the
>>> > relationship with item #4.
>>> >
>>> > What I don't want is; if a user ordered items #A, #B, #C separately
>>> > 'at some point in their order history' then recommen
>>> > d #A and #C to other users who order #B ... I still don't want this if
>>> the
>>> > items are similar and/or the users similar.
>>> >
>>> > Cheers
>>> >
>>> > Billy
>>> >
>>> >
>>> >
>>> > On 11 Apr 2013 18:28, "Sean Owen"  wrote:
>>> >>
>>> >> This sounds like just a most-similar-items problem. That's good news
>>> >> because that's simpler. The only question is how you want to compute
>>> >> item-item similarities. That could be based on use

Re: Is Mahout the right tool to recommend cross sales?

2013-04-11 Thread Sean Owen

You can try treating your orders as the 'users'. Then just compute
item-item similarities per usual.

On Thu, Apr 11, 2013 at 7:59 PM, Billy  wrote:
> Thanks for replying,
>
>
> I don't have users, well I do :-) but in this case it should not influence
> the recommendations
>
> ,
> these need to be based on the relationship between
> "
> items ordered with other items
> in the 'same order'
> ".
>
> E.g. If item #1 has been order with item #4
>
> [
> 22
> ]
> times and item #1 has been order with item #9
> [
> 57
> ]
> times then
> if I added item #1 to my order
> these would both be recommended
> but item #9 would be recommended above item #4 purely based on the fact that
> the relationship between item #1 and item #9 is greater than the
> relationship with item #4.
>
> What I don't want is; if a user ordered items #A, #B, #C separately
> 'at some point in their order history' then recommen
> d #A and #C to other users who order #B ... I still don't want this if the
> items are similar and/or the users similar.
>
> Cheers
>
> Billy
>
>
>
> On 11 Apr 2013 18:28, "Sean Owen"  wrote:
>>
>> This sounds like just a most-similar-items problem. That's good news
>> because that's simpler. The only question is how you want to compute
>> item-item similarities. That could be based on user-item interactions.
>> If you're on Hadoop, try the RowSimilarityJob (where you will need
>> rows to be items, columns the users).
>>
>> On Thu, Apr 11, 2013 at 6:11 PM, Billy  wrote:
>> > I am very new to Mahout and currently just ready up to chapter 5 of
>> > 'MIA'
>> > but after reading about the various User centric and Item centric
>> > recommenders they all seem to still need a userId so still unsure if
>> > Mahout
>> > can help with a fairly common recommendation.
>> >
>> > My requirement is to produce 'n' item recommendations based on a chosen
>> > item.
>> >
>> > E.g. "if I've added item #1 to my order then based on all the
>> > other items; in all the other orders for this site, what are the
>> > likely items that I may also want add to my order based; on the item to
>> > item relationship in the history of orders of this site?"
>> >
>> > Most probably using the most popular relationship between the item I
>> > have
>> > chosen and all the items in all the other orders.
>> >
>> > My data is not 'user' specific; and I don't think it should be, but more
>> > like order specific as its the pattern of items in each order that
>> > should
>> > determine the recommendation.
>> >
>> > I have no preference values so merely boolean preferences will be used.
>> >
>> > If Mahout can perform these calculations then how must I present the
>> > data?
>> >
>> > Will I need to shape the data in some way to feed into Mahout (currently
>> > versed in using Hadoop via Aws Emr using Java)
>> >
>> > Thanks for the advice in advance,
>> >
>> > Billy

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1635 matches

Mail list logo