Hi Ted,
Thanks a lot for your response. I think we both describe the same problem
but let me explain further what I want to do by quoting the following
prototype in R. Please keep in mind that we suppose that we are not aware
of the observations from the very beginning of the process as they are
coming in the form of a stream of data (see observ variable). However, we
do know the prior (which is either the up-to-date prior for that particular
user - if the user has already visited our site in the past, or the mean of
the priors of all users - if this is the first time that particular visits
our system)
prior <- 0.526
observ <-
c(0.758,0.619,0.657,0.463,0.740,0.557,0.634,0.711,0.828,0.726,0.704,0.783,0.467,0.666,0.729,0.614,0.568)
for (p in observ) {
lastp <- prior[length(prior)]
prior <- c(prior,(p*lastp)/(p*lastp+(1-p)*(1-lastp)))
}
Any help is very much appreciated.
On Thu, May 23, 2013 at 7:22 PM, Ted Dunning <[email protected]> wrote:
> It isn't clear what you want to do. You say general Bayesian inference,
> but then you seem to refer to a very specific, non-general form of
> inference.
>
> It also seems that you are never considering distributions at all, but
> merely doing something like Laplace correction[1] to compute the probably
> that the next site that a user will be be p(site_id | user_id) as the mean
> of the posterior distribution.
>
> To do this most easily, I think that you should just store the per user
> count with the key (user_id, site_id) and counts for the user and the site
> with the keys (user_id) (site_id). You will also need the total of all
> observations. You then have
>
> *Update *
>
> Given an observation (user_id, site_id):
> k_joint(user_id, site_id)++
> k_user(user_id)++
> k_site(site_id)++
> k_total++
>
> *Compute Probability*
>
> p = [k_joint(user_id, site_id) + \mu (k_site(site_id) / k_total)] /
> [k_user(user_id) + \mu]
>
> In this equation, you can set \mu to control how much weight you give to
> the empirical prior.
>
> [1] http://en.wikipedia.org/wiki/Rule_of_succession
>
>
> On Thu, May 23, 2013 at 2:55 AM, Adamantios Corais <
> [email protected]> wrote:
>
> > Hello,
> >
> > I am trying to implement the General Bayesian Inference machine learning
> > algorithm. The algorithm is really simple as an idea, but since I am new
> to
> > the hadoop ecosystem, I lack some experience. Anyway, here is the idea:
> >
> > From one hand I have a stream of <user_id, site_id> data, and from the
> > other I have a static table of <site_id, probability> data. Also, let's
> > suppose that the streaming data are related to a specific user only.
> What I
> > want to do is read the streaming data one by one - i.e. user's visiting
> > pages - go to the static table, retrieve the corresponding probability,
> and
> > do some very simple computations in order to compute the user's specific
> > probability. Finally, store the update probability somewhere - in memory
> -
> > as it will be the one to be used when we will consider the second
> streaming
> > line, toward the updating of the user's specific probability. This
> process
> > goes on and on until all <user_id, site_id> pairs have been read.
> >
> > My questions are summarised as follows:
> >
> > 1) Is there anything similar to be implemented already into mahout?
> > 2) Do I need mahout to implement this algorithm anyway?
> > 3) Should I use Pig + UDF instead?
> > 4) Or I should do everything in MapReduce?
> > 5) How could I store the static - and tiny - table in memory so as to
> avoid
> > loading of it again and again?
> >
> > Any help is very much appreciated.
> >
>