Hi All,

I need to load a month worth of processed data into a hive table. Table
have 10 partitions. Each day have many files to load and each file is
taking two seconds(constantly) and i have ~3000 files). So it will take
days to complete for 30 days worth of data.

I planned to load every day data parellaly into respective partition so
that i can complete it short time.

But i need clarrification before proceeding it.

Question:

1. Will it cause data loss/corruption by loading parellely in different
partition of same hive table ?

For example, Assume i am doing like below,

Table : processedlogs
Partition : logdate

Running below commands parellely,
LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-01');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-02');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-03');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-04');
.....
LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-30');

Thanks
Selva

Reply via email to