Hello, I am trying to implement the General Bayesian Inference machine learning algorithm. The algorithm is really simple as an idea, but since I am new to the hadoop ecosystem, I lack some experience. Anyway, here is the idea:
>From one hand I have a stream of <user_id, site_id> data, and from the other I have a static table of <site_id, probability> data. Also, let's suppose that the streaming data are related to a specific user only. What I want to do is read the streaming data one by one - i.e. user's visiting pages - go to the static table, retrieve the corresponding probability, and do some very simple computations in order to compute the user's specific probability. Finally, store the update probability somewhere - in memory - as it will be the one to be used when we will consider the second streaming line, toward the updating of the user's specific probability. This process goes on and on until all <user_id, site_id> pairs have been read. My questions are summarised as follows: 1) Is there anything similar to be implemented already into mahout? 2) Do I need mahout to implement this algorithm anyway? 3) Should I use Pig + UDF instead? 4) Or I should do everything in MapReduce? 5) How could I store the static - and tiny - table in memory so as to avoid loading of it again and again? Any help is very much appreciated.
