Mahout beginner questions...

Razon, Oren Thu, 22 Mar 2012 04:35:48 -0700

Hi,
As a data mining developer who need to build a recommender engine POC (Proof Of 
Concept) to support several future use cases, I've found Mahout framework as an 
appealing place to start with. But as I'm new to Mahout and Hadoop in general 
I've a couple of questions...


1.      In "Mahout in action" under section 3.2.5 (Database-based data) it 
says: "...Several classes in Mahout's recommender implementation will attempt 
to push computations into the database for performance...". I've looked in the 
documents and inside the code itself, but didn't found anywhere a reference to 
what are those calculations that are pushed into the DB. Could you please 
explain what could be done inside the DB?
2.      My future use will include use cases with small-medium data volumes 
(where I guess the non-distributed algorithms will do the job), but also use 
cases that include huge amounts of data (over 500,000,000 ratings). From my 
understanding this is where the distributed code should be come handy. My 
question here is, because I will need to use both distributed & non-distributed 
how could I build a good design here?
      Should I build two different solutions on different machines? Could I do 
part of the job distributed (for example similarity calculation) and the output 
will be used for the non-distributed code? Is it a BKM? Also if I deploy entire 
mahout code on an Hadoop environment, what does it mean for the non-distributed 
code, will it all run as a different java process on the name node?
3.      As for now, beside of the Hadoop cluster we are building we have some 
strong SQL machines (Netezza appliance) that can handle big (structure) data 
and include good integration with 3'rd party analytics providers or developing 
on java platform but don't include such reach recommender framework like 
Mahout. I'm trying to understand how could I utilize both solutions (Netezza & 
Mahout) to handle big data recommender system use cases. Thought maybe to move 
data into Netezza, do there all data manipulation and transformation, and in 
the end to prepare a file that contain the classic data model structure needed 
for Mahout. But could you think on better solution \ architecture? Maybe 
keeping the data only inside Netezza and extracting it to Mahout using JDBC 
when needed? I will be glad to hear your ideas :)

Thanks,
Oren








---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Mahout beginner questions...

Reply via email to