If the failure of the loading is severe enough, like the whole machine crashes, that there might not be an opportunity to catch the exception and cleanup the partition right away. The best I can think of is to cleanup the partition in a background job reasonably regularly. In that case, before the cleanup, is there anyway I can prevent any query from seeing the data in the partition that should not be there?
Or will this really happens? If the metadata is only updated after the successful load, the partition may not exist unless the load runs till its end. On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes <fatal.er...@gmail.com> wrote: > easiest way to achieve a level of robustness is probably to load into a > partition and then truncate the partition on the event of failure > > Cleaning up after an incomplete load is a problem in many traditional > rdbm's, you can not always rely on rollback functionality > > No explicit delete's in HIVE though so whatever you need to do to massage > and clean the data file is best done prior to inserting it into it's final > destination. > > Many of the things you bring up are more ETL best practices then properties > of an RDBMS implementation though. > Guy > > > On Tue, Jun 14, 2011 at 8:57 AM, W S Chung <qp.wsch...@gmail.com> wrote: > >> My question is a "what if" question, not a production issue. It seems >> natural, when replacing traditional database with hive, to ask >> how much robustness is sacrificed for scalability. My concern is that if a >> file is partially loaded, there might not be an easy way to clean up the >> already loaded data before re-loading the data. The lack of unique index >> also does not make it easy to avoid duplicate data either, although >> duplicated data can perhaps be deleted after the load. >> >> >> On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek <martin.koni...@gmail.com >> > wrote: >> >>> Hi, >>> >>> I think this is a problem with open source in general and sometimes it >>> can be very frustrating. >>> However, your question is more of a "what if" question - you're not in >>> the trouble of finding a horrible bug after you deployed to production, am I >>> right? >>> >>> Regarding your question, I would guess that if LOAD DATA INPATH crashes >>> while moving files into the Hive warehouse, the data which was moved will >>> appear as legitimate loaded data. Or the files will be moved but the >>> metadata will not be updated. In any case, you should detect the crash and >>> redo the operation. The easiest answer might actually be to look into the >>> source code - sometimes it can be easier to find than one would expect. >>> >>> Not a complete answer, but hope this helps a bit. >>> >>> Martin >>> >>> >>> On 14/06/2011 00:47, W S Chung wrote: >>> >>>> I submit a question like this before, but somehow that question is never >>>> delivered. I can even find my question in google. Since I cannot find any >>>> admin e-mail/feedback form on the hive website that I can ask why the last >>>> question is not delivered. There is not much option other than to post the >>>> question again and hope that the question get through this time. Sorry for >>>> the double posting if you have seen my last e-mail. >>>> >>>> What is the behaviour if a client of hive crashes in the middle of >>>> running a "load data inpath" for either a local file or a file on HDFS? >>>> Will >>>> the file be partially loaded in the db? Thanks. >>>> >>>> >>>> >> >