you are correct, delta files will be generated From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>> Date: Monday, September 26, 2016 at 1:01 AM To: user <user@hive.apache.org<mailto:user@hive.apache.org>> Subject: Re: populating Hive table periodically from files on HDFS
Thanks Eugene, My table in Hive happens to be ORC so making it bucketed and transactional is trivial. However, there is an underlying concern of mine. Hive transactional table will generate a lot of delta (if periodically appended). At least this is my understanding is what is going to happen. Once delta files are created, then Spark is unable to read data. This is unfortunate but true. My main analysis relies on Spark that read Hive table. In that case queries won't work. So it is a neat suggestion but my concern is that we are going to be left with delta files and until compaction happens nothing can be done to make Spark read data. If my assumptions are incorrect, I stand corrected. Regards Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk.Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 26 September 2016 at 04:50, Eugene Koifman <ekoif...@hortonworks.com<mailto:ekoif...@hortonworks.com>> wrote: Have you considered Hive Streaming? (https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest) It’s built for exactly such use case. Both Flume and Storm are integrated with it and write directly to your target table. Eugene From: Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <user@hive.apache.org<mailto:user@hive.apache.org>> Date: Sunday, September 25, 2016 at 8:47 AM To: user <user@hive.apache.org<mailto:user@hive.apache.org>> Subject: Re: populating Hive table periodically from files on HDFS Thanks I agree I think using INSERT OWERWRITE to repopulate data in the partition is bullet proof with nothing left behind. Performance looks good as well. When creating partitions by date it seems to be more effective to partition by a single string of ‘YYYY-MM-DD’ rather than use a multi-depth partition Year, Months, Days etc. I thought about bucketing the partitions but one needs to balance the housekeeping with the number of buckets within each partition. So I did not bother. Cheers Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk.Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 25 September 2016 at 12:19, Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote: I think what you propose makes sense. If you would do a delta load you gain not much performance benefits (most likely you will have less performance because you need to figure out what has changed, have the typical issues of distributed systems that some changes may arrive later, error handling etc). Especially given the volumes. You may partition smaller, but yes given the volumes not really needed. On 25 Sep 2016, at 12:32, Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote: Hi, I have trade data delivered through kafka and flume as csv files to HDFS. There are 100 prices every 2 seconds so in a minute there are 3000 new rows, 18K rows an hour and in a day 4,320,000 new rows. Flume creates a new sub directory partition ever day in the format YYYY-MM-DD like prices/2015-09-25 on HDFS There is an external Hive table pointing to new directory by simply altering external table location ALTER TABLE ${DATABASE}.externalMarketData set location 'hdfs://rhes564:9000/data/prices/${TODAY}'; This means that the external Hive table only points to the current directory. The target internal table in Hive is partitioned by DateStamp ="YYYY-MM-DD" PARTITIONED BY (DateStamp string) to populate the Hive table a cron job runs every 15 minutes and does simply INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp = "${TODAY}") SELECT ''''''''''''''''''''''''' ) FROM ${DATABASE}.externalMarketData So effectively every 15 minutes today's partition is overwritten by new data from the external table. This seems to be OK. The other option is only add new rows since last time with INSERT INTO WHERE rows do not exist in target table. Any other suggestions? Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk.Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.