Hi Laxman, We use both #1 and #3 from MySQL which also has hi speed exports. For our 300G and 340M rows, #1 takes us around 3 hours, with Sqoop it is closer to 8 hrs to our 3 node cluster.
We are having issues with delimiters though (since we have \r, \t and \n in the database), and now using Avro as a compression to overcome these. A colleague of mine is currently doing 2 patches for Sqoop to fix some bugs in this. For the majority of our processing we don't use HBase but straight to HDFS, and we use Sqoop wrapping it all up in Oozie workflows which is working very nicely. We do a lot of ETL processing from one DB to another through an Oozie workflow having Sqoop and Hive. HTH, Tim On Tue, Jan 31, 2012 at 7:21 AM, Jonathan Hsieh <[email protected]> wrote: > Hi Laxman, > > I'm an Apache HBase and Sqoop committer. I haven't run the comparison > you've suggested but my first thoughts are to consider #1 and #3. > > Case 1 will natively export data which should be the fastest way to get > data out of Oracle. You many need to do some reformatting to use HBase's > importtsv. > > Case 2, writing your own DBInputFormat, is essentially duplicating what > Sqoop does in the generic case. Here it is essentially executing sql > queries against the rdbms to get data out which will likely be a bit slower > than a database's native bulk export feature. > > Case 3, consider using Apache Sqoop in conjunction with Quest's data > connector for oracle and hadoop. It is a free as in beer plugin to sqoop! > This is probably the fastest from a getting started point of view (no dev > time) but may not be as performant as #1. I'd give it a try. > http://www.quest.com/data-connector-for-oracle-and-hadoop/ > > If you have more questions about Sqoop, feel free to cross post to > [email protected]! > > Jon. > > On Mon, Jan 30, 2012 at 5:48 AM, Laxman <[email protected]> wrote: > >> Hi, >> >> We have the following use-case. >> >> We have data in relational database (Oracle). >> We need to export this data to HBase and perform analysis on this data. >> We need to perform this export-import 500G periodically, say every month. >> >> Following are the different approaches I can see as per my knowledge. >> Before testing and finding out the best way by myself, I wanted to listen >> from the experts here. >> >> Approach #1 >> =========== >> 1) Export from Oracle to raw text file (Using Oracle export utility - >> Faster >> - Involves no transactional overhead) >> >> 2) Upload text file to HDFS >> >> 3) Run the bulk load job (HFileOutputFormat.configureIncrementalLoad()) >> >> Approach #2 >> =========== >> 1) Write a custom Job using DBInputFormat to directly read from database. >> - Just a thought to avoid multiple hops(Oracle to Local FS, Local FS >> to HDFS, HDFS to HBase) involved in approach #1. >> >> 2) Use the HBase bulk load tool to load this data to >> HBase.(HFileOutputFormat.configureIncrementalLoad()) >> >> Approach #3 >> =========== >> 1) Use Apache Sqoop (Currently under incubation) to achieve my requirement. >> - I'm not aware of the istability of this. >> >> Also, please suggest me if we have a better approach than the above. >> >> >> >> > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [email protected]
