I think if you load a file, validate it, and then* alter table add partition *to the final table at the end, in the event of crash you only have a partially loaded etl file that no one will be querying anyway.
That should work, though I am not speaking from personal experience, at least not with HIVE Guy On Wed, Jun 15, 2011 at 12:11 PM, W S Chung <qp.wsch...@gmail.com> wrote: > If the failure of the loading is severe enough, like the whole machine > crashes, that there might not be an opportunity to catch the exception and > cleanup the partition right away. The best I can think of is to cleanup the > partition in a background job reasonably regularly. In that case, before the > cleanup, is there anyway I can prevent any query from seeing the data in the > partition that should not be there? > > Or will this really happens? If the metadata is only updated after the > successful load, the partition may not exist unless the load runs till its > end. > > > On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes <fatal.er...@gmail.com> wrote: > >> easiest way to achieve a level of robustness is probably to load into a >> partition and then truncate the partition on the event of failure >> >> Cleaning up after an incomplete load is a problem in many traditional >> rdbm's, you can not always rely on rollback functionality >> >> No explicit delete's in HIVE though so whatever you need to do to massage >> and clean the data file is best done prior to inserting it into it's final >> destination. >> >> Many of the things you bring up are more ETL best practices then >> properties of an RDBMS implementation though. >> Guy >> >> >> On Tue, Jun 14, 2011 at 8:57 AM, W S Chung <qp.wsch...@gmail.com> wrote: >> >>> My question is a "what if" question, not a production issue. It seems >>> natural, when replacing traditional database with hive, to ask >>> how much robustness is sacrificed for scalability. My concern is that if >>> a file is partially loaded, there might not be an easy way to clean up the >>> already loaded data before re-loading the data. The lack of unique index >>> also does not make it easy to avoid duplicate data either, although >>> duplicated data can perhaps be deleted after the load. >>> >>> >>> On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek < >>> martin.koni...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I think this is a problem with open source in general and sometimes it >>>> can be very frustrating. >>>> However, your question is more of a "what if" question - you're not in >>>> the trouble of finding a horrible bug after you deployed to production, am >>>> I >>>> right? >>>> >>>> Regarding your question, I would guess that if LOAD DATA INPATH crashes >>>> while moving files into the Hive warehouse, the data which was moved will >>>> appear as legitimate loaded data. Or the files will be moved but the >>>> metadata will not be updated. In any case, you should detect the crash and >>>> redo the operation. The easiest answer might actually be to look into the >>>> source code - sometimes it can be easier to find than one would expect. >>>> >>>> Not a complete answer, but hope this helps a bit. >>>> >>>> Martin >>>> >>>> >>>> On 14/06/2011 00:47, W S Chung wrote: >>>> >>>>> I submit a question like this before, but somehow that question is >>>>> never delivered. I can even find my question in google. Since I cannot >>>>> find >>>>> any admin e-mail/feedback form on the hive website that I can ask why the >>>>> last question is not delivered. There is not much option other than to >>>>> post >>>>> the question again and hope that the question get through this time. Sorry >>>>> for the double posting if you have seen my last e-mail. >>>>> >>>>> What is the behaviour if a client of hive crashes in the middle of >>>>> running a "load data inpath" for either a local file or a file on HDFS? >>>>> Will >>>>> the file be partially loaded in the db? Thanks. >>>>> >>>>> >>>>> >>> >> >