Re: load data unit of work

Guy Bayes Wed, 15 Jun 2011 12:30:12 -0700

I think if you load a file, validate it, and then* alter table add partition
*to the final table at the end, in the event of crash you only have a
partially loaded etl file that no one will be querying anyway.


That should work, though I am not speaking from personal experience, at
least not with HIVE
Guy

On Wed, Jun 15, 2011 at 12:11 PM, W S Chung <qp.wsch...@gmail.com> wrote:

> If the failure of the loading is severe enough, like the whole machine
> crashes, that there might not be an opportunity to catch the exception and
> cleanup the partition right away. The best I can think of is to cleanup the
> partition in a background job reasonably regularly. In that case, before the
> cleanup, is there anyway I can prevent any query from seeing the data in the
> partition that should not be there?
>
> Or will this really happens? If the metadata is only updated after the
> successful load, the partition may not exist unless the load runs till its
> end.
>
>
> On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes <fatal.er...@gmail.com> wrote:
>
>> easiest way to achieve a level of robustness is probably to load into a
>> partition and then truncate the partition on the event of failure
>>
>> Cleaning up after an incomplete load is a problem in many traditional
>> rdbm's,  you can not always rely on rollback functionality
>>
>> No explicit delete's in HIVE though so whatever you need to do to massage
>> and clean the data file is best done prior to inserting it into it's final
>> destination.
>>
>> Many of the things you bring up are more ETL best practices then
>> properties of an RDBMS implementation though.
>>  Guy
>>
>>
>> On Tue, Jun 14, 2011 at 8:57 AM, W S Chung <qp.wsch...@gmail.com> wrote:
>>
>>> My question is a "what if" question, not a production issue. It seems
>>> natural, when replacing traditional database with hive, to ask
>>> how much robustness is sacrificed for scalability. My concern is that if
>>> a file is partially loaded, there might not be an easy way to clean up the
>>> already loaded data before re-loading the data. The lack of unique index
>>> also does not make it easy to avoid duplicate data either, although
>>> duplicated data can perhaps be deleted after the load.
>>>
>>>
>>> On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek <
>>> martin.koni...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think this is a problem with open source in general and sometimes it
>>>> can be very frustrating.
>>>> However, your question is more of a "what if" question - you're not in
>>>> the trouble of finding a horrible bug after you deployed to production, am 
>>>> I
>>>> right?
>>>>
>>>> Regarding your question, I would guess that if LOAD DATA INPATH crashes
>>>> while moving files into the Hive warehouse, the data which was moved will
>>>> appear as legitimate loaded data. Or the files will be moved but the
>>>> metadata will not be updated. In any case, you should detect the crash and
>>>> redo the operation. The easiest answer might actually be to look into the
>>>> source code - sometimes it can be easier to find than one would expect.
>>>>
>>>> Not a complete answer, but hope this helps a bit.
>>>>
>>>> Martin
>>>>
>>>>
>>>> On 14/06/2011 00:47, W S Chung wrote:
>>>>
>>>>> I submit a question like this before, but somehow that question is
>>>>> never delivered. I can even find my question in google. Since I cannot 
>>>>> find
>>>>> any admin e-mail/feedback form on the hive website that I can ask why the
>>>>> last question is not delivered. There is not much option other than to 
>>>>> post
>>>>> the question again and hope that the question get through this time. Sorry
>>>>> for the double posting if you have seen my last e-mail.
>>>>>
>>>>> What is the behaviour if  a client of hive crashes in the middle of
>>>>> running a "load data inpath" for either a local file or a file on HDFS? 
>>>>> Will
>>>>> the file be partially loaded in the db? Thanks.
>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: load data unit of work

Reply via email to