Hi Srivathsan I did get your requirement and if you are in for developing a map reduce program, may be you can achieve the same in two steps. 1. SQOOP Import data into HDFS 2. Use a custom map reduce to de normalize the same. In your map reduce join the two data sets using the foreign key. For this
In Mapper -Use Multiple Input Format to parse the two different inputs -project the data in each mapper with key as the foreign key and the value as the combination of required column values in each case -prefix an identifier to identify the source of the values in the reducer In Reducer - the identifier helps to know the source of data(which table the data is from) - Append the data from different sources as required(choose output key value as required) The solution is a crude one and has its flaws. Just scribbled a possible solution from my limited expertise. Experts, please share some sophisticated approach which you feel is optimal in handling such requirements. Regards Bejoy K S -----Original Message----- From: Srivathsan Srinivas <periya.d...@gmail.com> Date: Sat, 20 Aug 2011 23:06:06 To: <sqoop-user@incubator.apache.org> Reply-To: sqoop-user@incubator.apache.org Subject: denormalize data during import Hi, I am looking for a way to denormalize my tables when I am importing from MySQL to HDFS/HIVE. What is the best approach? Is there a way to add columns dynamically when using Sqoop or any other means? For eg., an RDBMS table of students might have student name and address(es). This may be a normalized form in different tables. What I want during my import process is a single row for each student where there will be multiple columns for all the addresses for a given student (like addr_1, addr_2, ...etc). I can write UDFs for it, but, do not know how to attach it to sqoop. I undertand that sqoop can add SQL queries (such as join) before importing. But, that will add more rows. I want a single row with multiple columns. Should I have UDFs in Sqoop or Oozie in my workflow and denormalize them on the fly? I have not tried it. Is there any suggested approach? I am beginning to look at Solr's DataImportHandler...perhaps that could solve this. Suggestions are appreciated. Thanks, PD.