Re: load data unit of work

W S Chung Wed, 15 Jun 2011 12:48:10 -0700

If that is the case, I'll just need to cleanup the partially loaded hdfs
file in a background job. That should do.


On Wed, Jun 15, 2011 at 3:28 PM, Guy Bayes <fatal.er...@gmail.com> wrote:

> I think if you load a file, validate it, and then* alter table add
> partition *to the final table at the end, in the event of crash you only
> have a partially loaded etl file that no one will be querying anyway.
>
> That should work, though I am not speaking from personal experience, at
> least not with HIVE
> Guy
>
>
> On Wed, Jun 15, 2011 at 12:11 PM, W S Chung <qp.wsch...@gmail.com> wrote:
>
>> If the failure of the loading is severe enough, like the whole machine
>> crashes, that there might not be an opportunity to catch the exception and
>> cleanup the partition right away. The best I can think of is to cleanup the
>> partition in a background job reasonably regularly. In that case, before the
>> cleanup, is there anyway I can prevent any query from seeing the data in the
>> partition that should not be there?
>>
>> Or will this really happens? If the metadata is only updated after the
>> successful load, the partition may not exist unless the load runs till its
>> end.
>>
>>
>> On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes <fatal.er...@gmail.com>wrote:
>>
>>> easiest way to achieve a level of robustness is probably to load into a
>>> partition and then truncate the partition on the event of failure
>>>
>>> Cleaning up after an incomplete load is a problem in many traditional
>>> rdbm's,  you can not always rely on rollback functionality
>>>
>>> No explicit delete's in HIVE though so whatever you need to do to massage
>>> and clean the data file is best done prior to inserting it into it's final
>>> destination.
>>>
>>> Many of the things you bring up are more ETL best practices then
>>> properties of an RDBMS implementation though.
>>>  Guy
>>>
>>>
>>> On Tue, Jun 14, 2011 at 8:57 AM, W S Chung <qp.wsch...@gmail.com> wrote:
>>>
>>>> My question is a "what if" question, not a production issue. It seems
>>>> natural, when replacing traditional database with hive, to ask
>>>> how much robustness is sacrificed for scalability. My concern is that if
>>>> a file is partially loaded, there might not be an easy way to clean up the
>>>> already loaded data before re-loading the data. The lack of unique index
>>>> also does not make it easy to avoid duplicate data either, although
>>>> duplicated data can perhaps be deleted after the load.
>>>>
>>>>
>>>> On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek <
>>>> martin.koni...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I think this is a problem with open source in general and sometimes it
>>>>> can be very frustrating.
>>>>> However, your question is more of a "what if" question - you're not in
>>>>> the trouble of finding a horrible bug after you deployed to production, 
>>>>> am I
>>>>> right?
>>>>>
>>>>> Regarding your question, I would guess that if LOAD DATA INPATH crashes
>>>>> while moving files into the Hive warehouse, the data which was moved will
>>>>> appear as legitimate loaded data. Or the files will be moved but the
>>>>> metadata will not be updated. In any case, you should detect the crash and
>>>>> redo the operation. The easiest answer might actually be to look into the
>>>>> source code - sometimes it can be easier to find than one would expect.
>>>>>
>>>>> Not a complete answer, but hope this helps a bit.
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>> On 14/06/2011 00:47, W S Chung wrote:
>>>>>
>>>>>> I submit a question like this before, but somehow that question is
>>>>>> never delivered. I can even find my question in google. Since I cannot 
>>>>>> find
>>>>>> any admin e-mail/feedback form on the hive website that I can ask why the
>>>>>> last question is not delivered. There is not much option other than to 
>>>>>> post
>>>>>> the question again and hope that the question get through this time. 
>>>>>> Sorry
>>>>>> for the double posting if you have seen my last e-mail.
>>>>>>
>>>>>> What is the behaviour if  a client of hive crashes in the middle of
>>>>>> running a "load data inpath" for either a local file or a file on HDFS? 
>>>>>> Will
>>>>>> the file be partially loaded in the db? Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: load data unit of work

Reply via email to