Hi,
As a data mining developer who need to build a recommender engine POC (Proof Of
Concept) to support several future use cases, I've found Mahout framework as an
appealing place to start with. But as I'm new to Mahout and Hadoop in general
I've a couple of questions...
1. In "Mahout in action" under section 3.2.5 (Database-based data) it
says: "...Several classes in Mahout's recommender implementation will attempt
to push computations into the database for performance...". I've looked in the
documents and inside the code itself, but didn't found anywhere a reference to
what are those calculations that are pushed into the DB. Could you please
explain what could be done inside the DB?
2. My future use will include use cases with small-medium data volumes
(where I guess the non-distributed algorithms will do the job), but also use
cases that include huge amounts of data (over 500,000,000 ratings). From my
understanding this is where the distributed code should be come handy. My
question here is, because I will need to use both distributed & non-distributed
how could I build a good design here?
Should I build two different solutions on different machines? Could I do
part of the job distributed (for example similarity calculation) and the output
will be used for the non-distributed code? Is it a BKM? Also if I deploy entire
mahout code on an Hadoop environment, what does it mean for the non-distributed
code, will it all run as a different java process on the name node?
3. As for now, beside of the Hadoop cluster we are building we have some
strong SQL machines (Netezza appliance) that can handle big (structure) data
and include good integration with 3'rd party analytics providers or developing
on java platform but don't include such reach recommender framework like
Mahout. I'm trying to understand how could I utilize both solutions (Netezza &
Mahout) to handle big data recommender system use cases. Thought maybe to move
data into Netezza, do there all data manipulation and transformation, and in
the end to prepare a file that contain the classic data model structure needed
for Mahout. But could you think on better solution \ architecture? Maybe
keeping the data only inside Netezza and extracting it to Mahout using JDBC
when needed? I will be glad to hear your ideas :)
Thanks,
Oren
---------------------------------------------------------------------
Intel Electronics Ltd.
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.