It isn't clear what you want to do.  You say general Bayesian inference,
but then you seem to refer to a very specific, non-general form of
inference.

It also seems that you are never considering distributions at all, but
merely doing something like Laplace correction[1] to compute the probably
that the next site that a user will be be p(site_id | user_id) as the mean
of the posterior distribution.

To do this most easily, I think that you should just store the per user
count with the key (user_id, site_id) and counts for the user and the site
with the keys (user_id) (site_id).  You will also need the total of all
observations.  You then have

*Update *

Given an observation (user_id, site_id):
    k_joint(user_id, site_id)++
    k_user(user_id)++
    k_site(site_id)++
    k_total++

*Compute Probability*

    p = [k_joint(user_id, site_id) + \mu (k_site(site_id) / k_total)] /
[k_user(user_id) + \mu]

In this equation, you can set \mu to control how much weight you give to
the empirical prior.

[1] http://en.wikipedia.org/wiki/Rule_of_succession


On Thu, May 23, 2013 at 2:55 AM, Adamantios Corais <
[email protected]> wrote:

> Hello,
>
> I am trying to implement the General Bayesian Inference machine learning
> algorithm. The algorithm is really simple as an idea, but since I am new to
> the hadoop ecosystem, I lack some experience. Anyway, here is the idea:
>
> From one hand I have a stream of <user_id, site_id> data, and from the
> other I have a static table of <site_id, probability> data. Also, let's
> suppose that the streaming data are related to a specific user only. What I
> want to do is read the streaming data one by one - i.e. user's visiting
> pages - go to the static table, retrieve the corresponding probability, and
> do some very simple computations in order to compute the user's specific
> probability. Finally, store the update probability somewhere - in memory -
> as it will be the one to be used when we will consider the second streaming
> line, toward the updating of the user's specific probability. This process
> goes on and on until all <user_id, site_id> pairs have been read.
>
> My questions are summarised as follows:
>
> 1) Is there anything similar to be implemented already into mahout?
> 2) Do I need mahout to implement this algorithm anyway?
> 3) Should I use Pig + UDF instead?
> 4) Or I should do everything in MapReduce?
> 5) How could I store the static - and tiny - table in memory so as to avoid
> loading of it again and again?
>
> Any help is very much appreciated.
>

Reply via email to