If that is the case, I'll just need to cleanup the partially loaded hdfs file in a background job. That should do.
On Wed, Jun 15, 2011 at 3:28 PM, Guy Bayes <fatal.er...@gmail.com> wrote: > I think if you load a file, validate it, and then* alter table add > partition *to the final table at the end, in the event of crash you only > have a partially loaded etl file that no one will be querying anyway. > > That should work, though I am not speaking from personal experience, at > least not with HIVE > Guy > > > On Wed, Jun 15, 2011 at 12:11 PM, W S Chung <qp.wsch...@gmail.com> wrote: > >> If the failure of the loading is severe enough, like the whole machine >> crashes, that there might not be an opportunity to catch the exception and >> cleanup the partition right away. The best I can think of is to cleanup the >> partition in a background job reasonably regularly. In that case, before the >> cleanup, is there anyway I can prevent any query from seeing the data in the >> partition that should not be there? >> >> Or will this really happens? If the metadata is only updated after the >> successful load, the partition may not exist unless the load runs till its >> end. >> >> >> On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes <fatal.er...@gmail.com>wrote: >> >>> easiest way to achieve a level of robustness is probably to load into a >>> partition and then truncate the partition on the event of failure >>> >>> Cleaning up after an incomplete load is a problem in many traditional >>> rdbm's, you can not always rely on rollback functionality >>> >>> No explicit delete's in HIVE though so whatever you need to do to massage >>> and clean the data file is best done prior to inserting it into it's final >>> destination. >>> >>> Many of the things you bring up are more ETL best practices then >>> properties of an RDBMS implementation though. >>> Guy >>> >>> >>> On Tue, Jun 14, 2011 at 8:57 AM, W S Chung <qp.wsch...@gmail.com> wrote: >>> >>>> My question is a "what if" question, not a production issue. It seems >>>> natural, when replacing traditional database with hive, to ask >>>> how much robustness is sacrificed for scalability. My concern is that if >>>> a file is partially loaded, there might not be an easy way to clean up the >>>> already loaded data before re-loading the data. The lack of unique index >>>> also does not make it easy to avoid duplicate data either, although >>>> duplicated data can perhaps be deleted after the load. >>>> >>>> >>>> On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek < >>>> martin.koni...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I think this is a problem with open source in general and sometimes it >>>>> can be very frustrating. >>>>> However, your question is more of a "what if" question - you're not in >>>>> the trouble of finding a horrible bug after you deployed to production, >>>>> am I >>>>> right? >>>>> >>>>> Regarding your question, I would guess that if LOAD DATA INPATH crashes >>>>> while moving files into the Hive warehouse, the data which was moved will >>>>> appear as legitimate loaded data. Or the files will be moved but the >>>>> metadata will not be updated. In any case, you should detect the crash and >>>>> redo the operation. The easiest answer might actually be to look into the >>>>> source code - sometimes it can be easier to find than one would expect. >>>>> >>>>> Not a complete answer, but hope this helps a bit. >>>>> >>>>> Martin >>>>> >>>>> >>>>> On 14/06/2011 00:47, W S Chung wrote: >>>>> >>>>>> I submit a question like this before, but somehow that question is >>>>>> never delivered. I can even find my question in google. Since I cannot >>>>>> find >>>>>> any admin e-mail/feedback form on the hive website that I can ask why the >>>>>> last question is not delivered. There is not much option other than to >>>>>> post >>>>>> the question again and hope that the question get through this time. >>>>>> Sorry >>>>>> for the double posting if you have seen my last e-mail. >>>>>> >>>>>> What is the behaviour if a client of hive crashes in the middle of >>>>>> running a "load data inpath" for either a local file or a file on HDFS? >>>>>> Will >>>>>> the file be partially loaded in the db? Thanks. >>>>>> >>>>>> >>>>>> >>>> >>> >> >