Hello,

I am trying to implement the General Bayesian Inference machine learning
algorithm. The algorithm is really simple as an idea, but since I am new to
the hadoop ecosystem, I lack some experience. Anyway, here is the idea:

>From one hand I have a stream of <user_id, site_id> data, and from the
other I have a static table of <site_id, probability> data. Also, let's
suppose that the streaming data are related to a specific user only. What I
want to do is read the streaming data one by one - i.e. user's visiting
pages - go to the static table, retrieve the corresponding probability, and
do some very simple computations in order to compute the user's specific
probability. Finally, store the update probability somewhere - in memory -
as it will be the one to be used when we will consider the second streaming
line, toward the updating of the user's specific probability. This process
goes on and on until all <user_id, site_id> pairs have been read.

My questions are summarised as follows:

1) Is there anything similar to be implemented already into mahout?
2) Do I need mahout to implement this algorithm anyway?
3) Should I use Pig + UDF instead?
4) Or I should do everything in MapReduce?
5) How could I store the static - and tiny - table in memory so as to avoid
loading of it again and again?

Any help is very much appreciated.

Reply via email to