Re: Recommended approaches for large data import from RDBMS into hive (Approx Terabyte of data)

Thiruvel Thirumoolan Sun, 27 Mar 2011 03:30:39 -0700

I am not sure how this is going to work over such a large dataset as I have 
seen examples of a staging table with 900,000 rows of data fail when a dynamic 
partition import is applied.


What was the problem? I have created 120,000 dynamic partitions using 
distribute by. Guess that will solve your failure? See  
http://wiki.apache.org/hadoop/Hive/Tutorial#Dynamic-partition_Insert

On Mar 23, 2011, at 3:59 PM, Ryan Greenhall wrote:

My desire is to be able to run a query on the existing RDBMS (SQLServer) to 
de-normalise the required data into single rows that can then be imported into 
a hive table partitioned by date.

Previously, on a much smaller scale, I have achieved data import into Hive by 
copying a tsv of my data (already partitioned) into HDFS and then loading into 
Hive, for example.

LOAD DATA LOCAL INPATH `/tmp/pv_2008-06-08_us.txt` INTO TABLE page_view 
PARTITION(date='2008-06-08')


I have considered the following options and was looking for feedback or more 
effective solutions.

1) Create a data file for each day (we have over a years worth of data) and 
then import as above.  On a large scale this is likely to be rather cumbersome, 
although it could be scripted.

2) Use sqoop to import all the data into a stageing table and then use the 
dynamic partition insert.  Unfortunately Sqoop does not allow (as far as I can 
see) import into partitions.

I am not sure how this is going to work over such a large dataset as I have 
seen examples of a staging table with 900,000 rows of data fail when a dynamic 
partition import is applied.

Thanks in advance

Ryan

Re: Recommended approaches for large data import from RDBMS into hive (Approx Terabyte of data)

Reply via email to