Re: populating Hive table periodically from files on HDFS

Mich Talebzadeh Sun, 25 Sep 2016 08:48:05 -0700

Thanks

I agree I think using INSERT OWERWRITE to repopulate data in the partition
is bullet proof with nothing left behind. Performance looks good as well.


When creating partitions by date it seems to be more effective to partition
by a single string of ‘YYYY-MM-DD’ rather than use a multi-depth partition
Year, Months, Days  etc.

I thought about bucketing the partitions but one needs to balance the
housekeeping with the number of buckets within each partition. So I did not
bother.

Cheers



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 September 2016 at 12:19, Jörn Franke <jornfra...@gmail.com> wrote:

> I think what you propose makes sense. If you would do a delta load you
> gain not much performance benefits (most likely you will have less
> performance because you need to figure out what has changed, have the
> typical issues of distributed systems that some changes may arrive later,
> error handling etc). Especially given the volumes.
> You may partition smaller, but yes given the volumes not really needed.
>
> On 25 Sep 2016, at 12:32, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>
> Hi,
>
> I have trade data delivered through kafka and flume as csv files to HDFS.
> There are 100 prices every 2 seconds so in a minute there are 3000 new
> rows, 18K rows an hour and in a day 4,320,000 new rows.
>
> Flume creates a new sub directory partition ever day in the format
> YYYY-MM-DD like prices/2015-09-25 on HDFS
>
> There is an external Hive table pointing to new directory by simply
> altering external table location
>
> ALTER TABLE ${DATABASE}.externalMarketData set location
> 'hdfs://rhes564:9000/data/prices/${TODAY}';
>
> This means that the external Hive table only points to the current
> directory.
>
> The target internal table in Hive is partitioned by  DateStamp
> ="YYYY-MM-DD"
>
> PARTITIONED BY (DateStamp  string)
>
> to populate the Hive table a cron job runs every 15 minutes and does simply
>
> INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp =
> "${TODAY}")
> SELECT
> '''''''''''''''''''''''''
> )
> FROM ${DATABASE}.externalMarketData
>
> So effectively every 15 minutes *today's partition* is overwritten by new
> data from the external table.
>
> This seems to be OK.
>
> The other option is only add new rows since last time with INSERT INTO
> WHERE rows do not exist in target table.
>
> Any other suggestions?
>
>
> Thanks
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Re: populating Hive table periodically from files on HDFS

Reply via email to