Re: [GSOC] 2010 Timelines

2010-04-09 Thread Isabel Drost

Timeline including Apache internal deadlines:

http://cwiki.apache.org/confluence/display/COMDEVxSITE/GSoC

Mentors, please also click on the ranking link to the ranking explanation [1] 
for more information on how to rank student proposals.

Isabel

[1] 
http://cwiki.apache.org/confluence/display/COMDEVxSITE/Mentee+Ranking+Process


signature.asc
Description: This is a digitally signed message part.


[jira] Created: (MAHOUT-372) Partitioning Collaborative Filtering Job into Maps and Reduces

2010-04-09 Thread Kris Jack (JIRA)
Partitioning Collaborative Filtering Job into Maps and Reduces
--

 Key: MAHOUT-372
 URL: https://issues.apache.org/jira/browse/MAHOUT-372
 Project: Mahout
  Issue Type: Question
  Components: Collaborative Filtering
Affects Versions: 0.4
 Environment: Ubuntu Koala
Reporter: Kris Jack


I am running the org.apache.mahout.cf.taste.hadoop.item.RecommenderJob main on 
my hadoop cluster and it partitions the job in 2 although I have more than 2 
nodes available.  I was reading that the partitioning could be changed by 
setting the JobConf's conf.setNumMapTasks(int num) and 
conf.setNumReduceTasks(int num).

Would I be right in assuming that this would speed up the processing by 
increasing these, say to 4)?  Can this code be partitioned into many reducers?  
If so, would setting them in the protected AbstractJob::JobConf 
prepareJobConf() function be appropriate?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
Robin,

I think it does not make sense for me to catch with GSoC timeline now as I
am quite busy with other stuff. However, I will develop the proposal for
Association Mining (or GUHA if you like) and keep this discussion going on.
I am really interested in contributing some implementation to Mahout but as
of now the GCoS timeline is not of any help to me.

Let me look at this in detail and I will get back to mahout community with
more details.

As for the use cases for association mining there can be find a lot examples
in literature. When it comes to missing or negative attributes of the data
(of the transaction) I think there can be a lot of examples as well. One
example would be analysis of click stream, where you can learn that those
people visiting some negative comment on product blog never enter order
form. Not saying this is best example but in general this is the essence of
it. You simply need to take all possible values from the transaction into
account even if it is missing in the market basket. I also remember that
Simpson's paradox is often referred in connection with this issue. As for
GUHA the power of it is that it has well developed theory background. This
for example means that it stated formalized framework for various analysis
that have probably origin in psychology and similar kind of soft-science and
association-like-functions between data attributes can be expressed using
4ft table members and user thresholds.

The biggest challenge in implementing this would be the fact that the
analysis have to deal with all the data (not just the most frequent
patterns) and combinations. It is very resource expensive.

I am thinking of starting a wiki page for Mahout about association mining
... this could help, what do you think?

Regards,
Lukas

On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil robin.a...@gmail.com wrote:

 PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
 refactored out or even a vanilla implementation of FPGrowth is not so
 difficult to re-create by re-using the existing methods.

 Robin


 On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil robin.a...@gmail.com wrote:

  Hi Lukas,
 Sorry for being late to getting back to you on this.
  Association rule mining is a great addition to FPGrowth. I am not sure I
  understand GUHA method well but then again I understood Ted's LLR after
 some
  deep reading. Could you put up an interesting example to help us
 understand
  this method. Maybe starting from a transaction of shopping cart item ? A
  great demo is big plus for a GSOC project.
 
  Robin
 
 
  On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček lukas.vl...@gmail.com
 wrote:
 
  Hello,
 
  I would like to apply for Mahout GSoC 2010. My proposal is to implement
  Association Mining algorithm utilizing existing PFPGrowth implementation
 (
  http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
 
  As for the Assoiciation Mining I would like to implement a very general
  algorithm based on old and established method called GUHA [1]. Short and
  informative description of GUHA method can be found here [2]. Very
  exhaustive theoretical foundation can be found here [3]. Since the GUHA
  method has been developing from 60's and is very rich now and given the
  limited time during GSoC program it would be wise to reduce scope of
  initial
  implementation.
 
  There are two main topics that I would like to focus on during GSoC:
 
  1) Enhancing and generalization of PFPGworth:
  Frequent pattern mining is usually part of each association maining
 task.
  In
  Mahout there is existing implementation of PFPGrowth algorithm. I would
  like
  to utilize this implementaion but it would be necessary to enhance or
  generalize it first. The goal here would be to keep current
 functionality
  and performance of PFPGrowth but allow more generalized input/output
 data
  and conditions. We will need to get frequencies of very rare patterns
 thus
  if it will show up that mining only top K is limiting factor then we
 would
  need to allow relaxation of this approach. Also we will need to take
  account
  on negative features (those missing in individual transaction). Negative
  features can be either directly supported by implementation of FP
  algorithm
  or it can be coded at the feature level (decision about what is better
  approach will be part of the GSoC program). It should be noted that for
  the
  GSoC program we will narrow scope of association mining antecedents and
  succedents to conjunctions of data features only.
 
  2) API for custom association mining functions based on 4ft table:
  Association mining in GUHA sense means testing hypothesis on four-fold
  table
  (4ft) [see 2. item 5]. There has been proposed a lot of association
  functions for GUHA, some of them are based on statistical tests (for
  example
  Fischer test, Chi-square test), some are based on frequent analysis
 while
  not explicitly refering to statistical tests but in both 

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Robin Anil
Hi Lukáš,
It would have been great if you could have participated in GSOC,
there is time left. But you still have your proposal in the GSOC system.
Take your time to decide, but if you choose not participate to do remove the
application from the soc website.

Wiki page for association mining would a good start. The pattern mining
package needs to grow beyond just the FPGrowth. Resource intensive
operations are what Mahout should do best on large datasets using Hadoop. I
can help around the code as much as you like for making it more generic and
suitable for association mining.

Regards
Robin

On Fri, Apr 9, 2010 at 4:56 PM, Lukáš Vlček lukas.vl...@gmail.com wrote:

 Robin,

 I think it does not make sense for me to catch with GSoC timeline now as I
 am quite busy with other stuff. However, I will develop the proposal for
 Association Mining (or GUHA if you like) and keep this discussion going on.
 I am really interested in contributing some implementation to Mahout but as
 of now the GCoS timeline is not of any help to me.

 Let me look at this in detail and I will get back to mahout community with
 more details.

 As for the use cases for association mining there can be find a lot
 examples
 in literature. When it comes to missing or negative attributes of the data
 (of the transaction) I think there can be a lot of examples as well. One
 example would be analysis of click stream, where you can learn that those
 people visiting some negative comment on product blog never enter order
 form. Not saying this is best example but in general this is the essence of
 it. You simply need to take all possible values from the transaction into
 account even if it is missing in the market basket. I also remember that
 Simpson's paradox is often referred in connection with this issue. As for
 GUHA the power of it is that it has well developed theory background. This
 for example means that it stated formalized framework for various analysis
 that have probably origin in psychology and similar kind of soft-science
 and
 association-like-functions between data attributes can be expressed using
 4ft table members and user thresholds.

 The biggest challenge in implementing this would be the fact that the
 analysis have to deal with all the data (not just the most frequent
 patterns) and combinations. It is very resource expensive.

 I am thinking of starting a wiki page for Mahout about association mining
 ... this could help, what do you think?

 Regards,
 Lukas

 On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil robin.a...@gmail.com wrote:

  PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
  refactored out or even a vanilla implementation of FPGrowth is not so
  difficult to re-create by re-using the existing methods.
 
  Robin
 
 
  On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil robin.a...@gmail.com wrote:
 
   Hi Lukas,
  Sorry for being late to getting back to you on this.
   Association rule mining is a great addition to FPGrowth. I am not sure
 I
   understand GUHA method well but then again I understood Ted's LLR after
  some
   deep reading. Could you put up an interesting example to help us
  understand
   this method. Maybe starting from a transaction of shopping cart item ?
 A
   great demo is big plus for a GSOC project.
  
   Robin
  
  
   On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček lukas.vl...@gmail.com
  wrote:
  
   Hello,
  
   I would like to apply for Mahout GSoC 2010. My proposal is to
 implement
   Association Mining algorithm utilizing existing PFPGrowth
 implementation
  (
   http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
  
   As for the Assoiciation Mining I would like to implement a very
 general
   algorithm based on old and established method called GUHA [1]. Short
 and
   informative description of GUHA method can be found here [2]. Very
   exhaustive theoretical foundation can be found here [3]. Since the
 GUHA
   method has been developing from 60's and is very rich now and given
 the
   limited time during GSoC program it would be wise to reduce scope of
   initial
   implementation.
  
   There are two main topics that I would like to focus on during GSoC:
  
   1) Enhancing and generalization of PFPGworth:
   Frequent pattern mining is usually part of each association maining
  task.
   In
   Mahout there is existing implementation of PFPGrowth algorithm. I
 would
   like
   to utilize this implementaion but it would be necessary to enhance or
   generalize it first. The goal here would be to keep current
  functionality
   and performance of PFPGrowth but allow more generalized input/output
  data
   and conditions. We will need to get frequencies of very rare patterns
  thus
   if it will show up that mining only top K is limiting factor then we
  would
   need to allow relaxation of this approach. Also we will need to take
   account
   on negative features (those missing in individual transaction).
 Negative
   

[jira] Resolved: (MAHOUT-372) Partitioning Collaborative Filtering Job into Maps and Reduces

2010-04-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved MAHOUT-372.
--

   Resolution: Fixed
Fix Version/s: 0.4
 Assignee: Sean Owen

Yes, sure there's no particular limit to the number of mappers or reducers. 

These are Hadoop params, which you can set on the command line with, for 
example:
-Dmapred.map.tasks=10 -Dmapred.reduce.tasks=10

Reopen if that doesn't quite answer the question. (We can also discuss on 
mahout-u...@apache.org, perhaps, if this isn't necessarily a bug or enhancement 
request.)

 Partitioning Collaborative Filtering Job into Maps and Reduces
 --

 Key: MAHOUT-372
 URL: https://issues.apache.org/jira/browse/MAHOUT-372
 Project: Mahout
  Issue Type: Question
  Components: Collaborative Filtering
Affects Versions: 0.4
 Environment: Ubuntu Koala
Reporter: Kris Jack
Assignee: Sean Owen
 Fix For: 0.4


 I am running the org.apache.mahout.cf.taste.hadoop.item.RecommenderJob main 
 on my hadoop cluster and it partitions the job in 2 although I have more than 
 2 nodes available.  I was reading that the partitioning could be changed by 
 setting the JobConf's conf.setNumMapTasks(int num) and 
 conf.setNumReduceTasks(int num).
 Would I be right in assuming that this would speed up the processing by 
 increasing these, say to 4)?  Can this code be partitioned into many 
 reducers?  If so, would setting them in the protected AbstractJob::JobConf 
 prepareJobConf() function be appropriate?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-372) Partitioning Collaborative Filtering Job into Maps and Reduces

2010-04-09 Thread Kris Jack (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855381#action_12855381
 ] 

Kris Jack commented on MAHOUT-372:
--

Thanks for your reply.  I'll run it using the command line parameters and 
hopefully get it working faster.  Thanks for letting me know also about the 
other mailing list, I'll use that in the future for such questions.

 Partitioning Collaborative Filtering Job into Maps and Reduces
 --

 Key: MAHOUT-372
 URL: https://issues.apache.org/jira/browse/MAHOUT-372
 Project: Mahout
  Issue Type: Question
  Components: Collaborative Filtering
Affects Versions: 0.4
 Environment: Ubuntu Koala
Reporter: Kris Jack
Assignee: Sean Owen
 Fix For: 0.4


 I am running the org.apache.mahout.cf.taste.hadoop.item.RecommenderJob main 
 on my hadoop cluster and it partitions the job in 2 although I have more than 
 2 nodes available.  I was reading that the partitioning could be changed by 
 setting the JobConf's conf.setNumMapTasks(int num) and 
 conf.setNumReduceTasks(int num).
 Would I be right in assuming that this would speed up the processing by 
 increasing these, say to 4)?  Can this code be partitioned into many 
 reducers?  If so, would setting them in the protected AbstractJob::JobConf 
 prepareJobConf() function be appropriate?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-373) VectorDumper/VectorHelper doesn't dump values when dictionary is present

2010-04-09 Thread Drew Farris (JIRA)
VectorDumper/VectorHelper doesn't dump values when dictionary is present


 Key: MAHOUT-373
 URL: https://issues.apache.org/jira/browse/MAHOUT-373
 Project: Mahout
  Issue Type: Improvement
  Components: Utils
Affects Versions: 0.3
Reporter: Drew Farris
Priority: Minor


Dumping term vectors with a dictionary using:

{{mahout vectordump -s vector-output/chunk-0 -dt sequencefile -d 
dictionary-output}}

gives me output like the following with no values, just the indexes expanded 
into their dictionary entries:


{code}
Name: 0001-11055 elts: {513:discard, 7199:empty,...
{code}

While dumping the same vector without a dictionary using

{{mahout vectordump -s vector-output/chunk-0}}

gives me output that includes indexes and values:

{code}
Name: 0001-11055 elts: {513:1.0, 7199:1.0...
{code}


Would it make sense for the dictionary based output to include values as well? 
Anyone opposed to modifying VectorHelper.vectorToString(Vector, String[]) to do 
so?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



GSOC Create Sql adapters proposal

2010-04-09 Thread Necati Batur
Hi,

Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all
the algorithms to use;

Necati Batur ; necatiba...@gmail.com



Mahout / Mahout - 332 : Assigned Mentor is Robin Anil



Proposal Abstract:

It would be useful to use thrift as the protocol with the noSQL systems, as
opposed to the native API of them so that a nice abstraction could be made
for all the NoSQL systems in general and specific thrift client
implementations added to maximize code re-use. Even if someone were to make
the port for 1 NoSQL client, having the demarcation would help to pick up
and port.

Detailed Description:

The data adapters fort he higher level languages will require the good
capability of using data structures and some information about finite
mathematics that I am confident on that issues.Then,the code given in svn
repository seems to need some improvements and also documetation.

Briefly,I would do the following operations fort his project



   1. Understand the underlying maths for adapters
   2. Determine the data structures that would be used for adapters
   3. Implement the neccassary methods to handle creation of these
   structures
   4. Some test cases that we probably would need to check whether our code
   cover all the issues required by a data retrieve operations
   5. New iterations on the code to robust the algorithms
   6. Documentation of overall project to join our particular Project to
   overall scope

Additional Information:

First of all,I am very excited to join an organization like GSOC and most
importantly work for a big open source Project apache.I am looking for a
good collaboration and new challenges on software development.Especially
information management issues sound great to me.I am confident to work with
all new technologies.I took the data structures I , II courses at university
so I am ok with data structures.Most importantly I am interested in
databases.From my software engineering courses experience I know how to work
on a project by iterative development and timelining


[jira] Created: (MAHOUT-374) GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342)

2010-04-09 Thread Yinghua Hu (JIRA)
GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342) 
-

 Key: MAHOUT-374
 URL: https://issues.apache.org/jira/browse/MAHOUT-374
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
 Environment: Ubuntu 9.10
Reporter: Yinghua Hu




* Introduction

In recent years, multicore computer becomes main stream. However, the potential 
of multicore has not been fully exploited for machine learning because the lack 
of good programming framework for multicore. Recently, Chu et. al. [CHU07] 
adapt Google's map-reduce paradigm [DEA04] and implemented 10 different machine 
learning algorithms on multicore processor. Their results show almost double 
speedup on average with a dual processor. Their work has inspired a lot of 
interesting projects under Mahout of Apache Software Foundation.

An artificial neural network (ANN) is a supervised learning tool inspired by 
the biological nervous system. It can capture and represent complex 
input/output relationships. The neural network has capability to represent both 
linear and non-linear relationships by learning the relationships directly from 
the data being modeled. Artificial neural network have been widely applied for 
classification, data processing and function approximation.

We propose to implement an artificial neural network with back-propagation 
under map-reduce framework. Success of delivery should include a fast neural 
network customized for multicore computer.



* Methodology

In this section, we briefly introduce map-reduce and back propagated neural 
network.

- Map-reduce

Map-reduce [DEA04] is a programming model developed by Google. It gets its name 
from the map and reduce primitives in the functional language such as Lisp. Its 
inventors realize that most of their distributed computation involve applying a 
map operation and reduce operation. Where map operation is to compute a set of 
intermediate key/value pairs for each logical record in the input and reduce 
operation is applied on all the values that share the same key in order to 
combine the derived data appropriately. The reduce process allow users to 
handle large value list difficult to fit in memory.

Map-reduce makes it possible to have a simple interface enabling automatic 
parallelization and distribution of large-scale computation. The programming 
model can be applied on multiple core personal computer as well as large 
clusters of commodity PCs.

Here is an example of map-reduce from [DEA04]. To count the number of 
occurrences of each word in a large collection of documents, the user can write 
like the pseudo code below:

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

   EmitIntermediate(w, 1);

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

   result += ParseInt(v);

   Emit(AsString(result))

The map function above emits each word and its associated counts (one in this 
example). The reduce function sums together all counts emitted for a particular 
word.

The Map-reduce model has been successfully applied to a large variety of 
problems, such as Google's Web search service, sorting, data mining etc. It 
helps to reduce the amount of data sent across the network. In addition, its 
easiness to use makes the parallel and distributed computing achievable even 
for inexperienced programmer.

[CHU07] provide a multicore map-reduce framework that is capable of 
implementing algorithm fitting the Statistical Query Model. Neural network is 
one of the ten machine learning algorithms fitting the requirement.

- Neural Network

Neural network is one of many modern technology inspired by biology. It is a 
model which can perform simple computation such as linear regression as well as 
very complex nonlinear calculation. Without doubt, neural network is one of the 
most popular machine learning methodology.

The simplest artificial neural network is a single-layer perceptron as shown in 
Figure 1.  It contains a single layer of adaptive weights.  The output is 
calculated by applying an activation function f to the weighted sum of inputs:

http://www.cs.ucf.edu/~yhu/gsoc/formula1.JPG (1)

http://www.cs.ucf.edu/~yhu/gsoc/fig1.JPG
Figure 1 - Single-layer Perceptron

The limitation of a single-layer perceptron is that it can only model linear 
relationships between an output and the inputs. This is overcome by multi-layer 
perceptrons. Multi-layer perceptrons contain several layers of adaptive 
weights. In Figure 2, a hidden layer was added into the single layer perceptron 
in Figure 1  to form a two-layer perceptron. If there is an activation function 
g for the first layer of adaptive weights 

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Ted Dunning
Lukas,

The strongest alternative for this kind of application (and the normal
choice for large scale applications) is on-line gradient descent learning
with an L_1 or L_1 + L_2 regularization.  The typical goal is to predict
some outcome (click or purchase or signup) from a variety of large
vocabulary features.  As such association mining is usually just a
pre-processing step before actual learning is applied.  There is some
indication that an efficient sparse on-line gradient descent algorithm
applied to features and combinations could do just as well, especially if
the learning on combinations is applied in several passes.

These on-line algorithms have the virtue of being extremely fast and with
feature sharding, have substantial potential for parallel implementation.

What do you think about these two methods?  Can you compare them?

On Fri, Apr 9, 2010 at 4:26 AM, Lukáš Vlček lukas.vl...@gmail.com wrote:

  One
 example would be analysis of click stream, where you can learn that those
 people visiting some negative comment on product blog never enter order
 form. Not saying this is best example but in general this is the essence of
 it. You simply need to take all possible values from the transaction into
 account even if it is missing in the market basket

 The biggest challenge in implementing this would be the fact that the
 analysis have to deal with all the data (not just the most frequent
 patterns) and combinations. It is very resource expensive.



[jira] Updated: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

2010-04-09 Thread Richard Simon Just (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Simon Just updated MAHOUT-371:
--

Description: 
Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ 
Recommender using Hadoop

Student Name: Richard Simon Just

Student E-mail:i...@richardsimonjust.co.uk

Organization/Project: Apache Mahout

Assigned Mentor:


Proposal Abstract: 
During the Netflix Prize Challenge one of the most popular forms of Recommender 
algorithm was that of Matrix Factorisation, in particular Singular Value 
Decomposition (SVD). As such this proposal looks to implement a distributed 
version of one of the most successful SVD-based recommender algorithms from the 
Netflix competition. Namely, the SVD++ algorithm. 
The SVD++ improves upon other basic SVD algorithms by incorporating implicit 
feedback[1]. That is to say that it is able to take into account not just 
explicit user preferences, but also feedback such as, in the case of a company 
like Netflix, whether a movie has been rented. Implicit feedback assumes that 
the fact of there being some correlation between the user and the item is more 
important that whether the correlation is positive or negative. Implicit 
feedback would account for an item has being rated, but not what the rating was.
The implementation will include testing, in-depth documentation and a 
demo/tutorial. If there is time, I will also look to developing the algorithm 
into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results 
of the SVD algorithm by taking into account temporal dynamics. Temporal 
dynamics addresses the way user preferences in items and their behaviour in how 
they rate items can change over time. According to [2] the gains in accuracy 
implementing timeSVD++ are significantly bigger than the gains going from SVD 
to SVD++. 
The overall project will provide three deliverables:
1. The basic framework for distributed SVD-based recommender
2. A distributed SVD++ implementation and demo
3. A distributed timeSVD++ implementation



Detailed Description:

The SVD++ algorithm uses the principle of categorising users and items into 
factors, combined with regularisation and implicit feedback to predict how much 
a user is likely to match an item. Factors are abstract categorises that are 
created from comparing the data presented. Factor values are grades saying how 
much each user/item is related to that category. For example with the Netflix 
data a factor could loosely correspond to a movie genre, a director or story 
line. The more factors used the more detailed the categories are likely to 
become, and thus the more accurate the predictions are likely to become. 

Implicit feedback is based on the theory that a user having any sort of 
relationship to an item is more important that whether they rated it, or what 
rating they gave. The assumption is that even if a user does not like an item, 
or has not rated it, the very fact that they choose to have some interaction 
with it indicates something about their preferences. In the Netflix case this 
would be represented by whether a user has rated a movie or not,  it could also 
take into account the user's rental history. 

As well as the actual implementation of the code the project has two other 
deliverable focuses. The readability, documentation, testing of the code; and a 
full tutorial and demo of the code. It is felt that without these things the 
implementation is not really complete or fully usable. 

The recommender consists of 5 main parts. The job class that runs the code, an 
input conversion section, a training section, a re-trainer section and a 
prediction section. A brief overview of these sections follows.


The SVD++  Classes:

The Recommender Job class:
This class is the foundation of the recommender and allows it to run on Hadoop 
by implementing the Tool interface through AbstractJob. This class will parse 
any user arguments and setup the jobs that will run the algorithm on 
Map/Reduce, much in the same way Mahout's other distributed recommenders, do 
such as RecommenderJob.


Input Mapper/Reducer Classes:
The Mapper will take the input data and convert it to key value pairs in the 
form of a hadoop Writeable. The Reducer will take the Mapper Writeables and 
create Sparse vectors. This is in keeping with Mahout's ToItemPrefsMapper and 
ToUserVectorReducer.

Training Mapper/Reducer Classes:
This phase creates the factor matrices that will be used to create the 
predictions later. It does this by making a prediction of a known rating using 
the SVD, comparing it against the known rating, calculating the error and 
updating the factors accordingly. The Mapper will loop through the training of 
the factor matrices. The Reducer will collect the outputs of the Mapper to 
create the dense factor matrices.

Re-trainer Mapper/Reducer Classes:
This is a separate job 

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
Ted,

do you think you can give some good links to paper or orther resources about
mentioned approaches? I would like to look at it after the weekend.
As far as I can see the association mining (and the guha method in its
original form) is not meant to be a predictive method but rather data
exploratory method (although having some kind of predictive power but not
formaly supported in the theoretical background). However, comparing
association mining to other approaches can be very interesting topic as
well.

Best regards,
Lukas

On Fri, Apr 9, 2010 at 8:03 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Lukas,

 The strongest alternative for this kind of application (and the normal
 choice for large scale applications) is on-line gradient descent learning
 with an L_1 or L_1 + L_2 regularization.  The typical goal is to predict
 some outcome (click or purchase or signup) from a variety of large
 vocabulary features.  As such association mining is usually just a
 pre-processing step before actual learning is applied.  There is some
 indication that an efficient sparse on-line gradient descent algorithm
 applied to features and combinations could do just as well, especially if
 the learning on combinations is applied in several passes.

 These on-line algorithms have the virtue of being extremely fast and with
 feature sharding, have substantial potential for parallel implementation.

 What do you think about these two methods?  Can you compare them?

 On Fri, Apr 9, 2010 at 4:26 AM, Lukáš Vlček lukas.vl...@gmail.com wrote:

   One
  example would be analysis of click stream, where you can learn that those
  people visiting some negative comment on product blog never enter order
  form. Not saying this is best example but in general this is the essence
 of
  it. You simply need to take all possible values from the transaction into
  account even if it is missing in the market basket
 
  The biggest challenge in implementing this would be the fact that the
  analysis have to deal with all the data (not just the most frequent
  patterns) and combinations. It is very resource expensive.
 



Mahout GSoC 2010: Association Mining

2010-04-09 Thread Neal Clark
Hello,

I just wanted to introduce myself. I am a MSc. Computer Science
student at the University of Victoria. My research over the past year
has been focused on developing and implementing an Apriori based
frequent item-set mining algorithm for mining large data sets at low
support counts.

https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaAhl=en

The main finding of the above report is that support levels as low as
0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
On a 100 core cluster all frequent k2 pairs can calculated in
approximately 6 minutes.

I currently have an optimized k2 Hadoop implementation and algorithm
for generating frequent pairs and I am currently extending my work to
items of any length. The analysis of the extended approach will be
complete within the next two weeks.

Would you be interesting in moving forward with such an implementation
 as a GSoC project? If so any comments/feedback would be very much
appreciated. If you are interested I can create a proposal and submit
it to your issue tracker when it comes back online.

Thanks,

Neal.


Re: Mahout GSoC 2010: Association Mining

2010-04-09 Thread Ted Dunning
Neal, I think that this might well be a useful contribution to Mahout, but,
if I am not mistaken, I think that the deadline for student proposals for
GSoC has just passed.

That likely means that making this contribution an official GSoC project is
not possible.  I am sure that the Mahout community would welcome you as a
contributor even without official Google status.  If you would like to do
this, go ahead and propose what you want to do (when JIRA comes back or just
by email discussion) and you can get started.

On Fri, Apr 9, 2010 at 2:11 PM, Neal Clark ncl...@uvic.ca wrote:

 Hello,

 I just wanted to introduce myself. I am a MSc. Computer Science
 student at the University of Victoria. My research over the past year
 has been focused on developing and implementing an Apriori based
 frequent item-set mining algorithm for mining large data sets at low
 support counts.


 https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaAhl=en

 The main finding of the above report is that support levels as low as
 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
 On a 100 core cluster all frequent k2 pairs can calculated in
 approximately 6 minutes.

 I currently have an optimized k2 Hadoop implementation and algorithm
 for generating frequent pairs and I am currently extending my work to
 items of any length. The analysis of the extended approach will be
 complete within the next two weeks.

 Would you be interesting in moving forward with such an implementation
  as a GSoC project? If so any comments/feedback would be very much
 appreciated. If you are interested I can create a proposal and submit
 it to your issue tracker when it comes back online.

 Thanks,

 Neal.