I am not sure how this is going to work over such a large dataset as I have seen examples of a staging table with 900,000 rows of data fail when a dynamic partition import is applied.
What was the problem? I have created 120,000 dynamic partitions using distribute by. Guess that will solve your failure? See http://wiki.apache.org/hadoop/Hive/Tutorial#Dynamic-partition_Insert On Mar 23, 2011, at 3:59 PM, Ryan Greenhall wrote: My desire is to be able to run a query on the existing RDBMS (SQLServer) to de-normalise the required data into single rows that can then be imported into a hive table partitioned by date. Previously, on a much smaller scale, I have achieved data import into Hive by copying a tsv of my data (already partitioned) into HDFS and then loading into Hive, for example. LOAD DATA LOCAL INPATH `/tmp/pv_2008-06-08_us.txt` INTO TABLE page_view PARTITION(date='2008-06-08') I have considered the following options and was looking for feedback or more effective solutions. 1) Create a data file for each day (we have over a years worth of data) and then import as above. On a large scale this is likely to be rather cumbersome, although it could be scripted. 2) Use sqoop to import all the data into a stageing table and then use the dynamic partition insert. Unfortunately Sqoop does not allow (as far as I can see) import into partitions. I am not sure how this is going to work over such a large dataset as I have seen examples of a staging table with 900,000 rows of data fail when a dynamic partition import is applied. Thanks in advance Ryan
