Hi Bijoy, Your suggested solution sounds good to me as it gives the flexibility to the user to decide how best to modify the data.
One alternative that could work in a limited manner is to use dynamic views instead of tables that have the denormalized relations already. That would allow the user to run a Sqoop import and directly get the necessary data set out in one shot. Thanks, Arvind On Sun, Aug 21, 2011 at 12:33 PM, <bejo...@gmail.com> wrote: > ** Hi Srivathsan > I did get your requirement and if you are in for developing a map reduce > program, may be you can achieve the same in two steps. > 1. SQOOP Import data into HDFS > 2. Use a custom map reduce to de normalize the same. > In your map reduce join the two data sets using the foreign key. For this > > In Mapper > -Use Multiple Input Format to parse the two different inputs > -project the data in each mapper with key as the foreign key and the value > as the combination of required column values in each case > -prefix an identifier to identify the source of the values in the reducer > > In Reducer > - the identifier helps to know the source of data(which table the data is > from) > - Append the data from different sources as required(choose output key > value as required) > > The solution is a crude one and has its flaws. Just scribbled a possible > solution from my limited expertise. > > Experts, please share some sophisticated approach which you feel is optimal > in handling such requirements. > > Regards > Bejoy K S > ------------------------------ > *From: * Srivathsan Srinivas <periya.d...@gmail.com> > *Date: *Sat, 20 Aug 2011 23:06:06 -0700 > *To: *<sqoop-user@incubator.apache.org> > *ReplyTo: * sqoop-user@incubator.apache.org > *Subject: *denormalize data during import > > Hi, > I am looking for a way to denormalize my tables when I am importing from > MySQL to HDFS/HIVE. What is the best approach? Is there a way to add columns > dynamically when using Sqoop or any other means? > > For eg., an RDBMS table of students might have student name and > address(es). This may be a normalized form in different tables. What I want > during my import process is a single row for each student where there will > be multiple columns for all the addresses for a given student (like addr_1, > addr_2, ...etc). I can write UDFs for it, but, do not know how to attach it > to sqoop. I undertand that sqoop can add SQL queries (such as join) before > importing. But, that will add more rows. I want a single row with multiple > columns. > > Should I have UDFs in Sqoop or Oozie in my workflow and denormalize them on > the fly? I have not tried it. Is there any suggested approach? I am > beginning to look at Solr's DataImportHandler...perhaps that could solve > this. > > Suggestions are appreciated. > > Thanks, > PD. >