Have you considered Hive Streaming?  
(https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest)
It's built for exactly such use case.
Both Flume and Storm are integrated with it and write directly to your target 
table.

Eugene

From: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Sunday, September 25, 2016 at 8:47 AM
To: user <user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: populating Hive table periodically from files on HDFS

Thanks

I agree I think using INSERT OWERWRITE to repopulate data in the partition is 
bullet proof with nothing left behind. Performance looks good as well.

When creating partitions by date it seems to be more effective to partition by 
a single string of 'YYYY-MM-DD' rather than use a multi-depth partition Year, 
Months, Days  etc.

I thought about bucketing the partitions but one needs to balance the 
housekeeping with the number of buckets within each partition. So I did not 
bother.

Cheers




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk.Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 25 September 2016 at 12:19, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
I think what you propose makes sense. If you would do a delta load you gain not 
much performance benefits (most likely you will have less performance because 
you need to figure out what has changed, have the typical issues of distributed 
systems that some changes may arrive later, error handling etc). Especially 
given the volumes.
You may partition smaller, but yes given the volumes not really needed.

On 25 Sep 2016, at 12:32, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:


Hi,

I have trade data delivered through kafka and flume as csv files to HDFS. There 
are 100 prices every 2 seconds so in a minute there are 3000 new rows, 18K rows 
an hour and in a day 4,320,000 new rows.

Flume creates a new sub directory partition ever day in the format YYYY-MM-DD 
like prices/2015-09-25 on HDFS

There is an external Hive table pointing to new directory by simply altering 
external table location

ALTER TABLE ${DATABASE}.externalMarketData set location 
'hdfs://rhes564:9000/data/prices/${TODAY}';

This means that the external Hive table only points to the current directory.

The target internal table in Hive is partitioned by  DateStamp ="YYYY-MM-DD"

PARTITIONED BY (DateStamp  string)

to populate the Hive table a cron job runs every 15 minutes and does simply

INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp = "${TODAY}")
SELECT
'''''''''''''''''''''''''
)
FROM ${DATABASE}.externalMarketData

So effectively every 15 minutes today's partition is overwritten by new data 
from the external table.

This seems to be OK.

The other option is only add new rows since last time with INSERT INTO WHERE 
rows do not exist in target table.

Any other suggestions?


Thanks










Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk.Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



Reply via email to