An alternative:

1. insert into new_t select tuple from t where date != duplicates date
2. insert into new_t select distinct tuple from t  where date = duplicates
date
3. drop table t
4. rename new_t to t

Hadoop does not allow updates/deletes of rows, it's highly unlikely that
all your duplicates are in one file.

2016-04-07 17:36 GMT+02:00 Vineet Goel <[email protected]>:

> You shouldn’t hack the HDFS files for HAWQ as they can go out of sync with
> the HAWQ master catalog. Since you don’t have partitions to drop, your only
> choice is to CTAS and filter out unwanted rows. I would also recommend you
> enhance your data-load logic to check for data integrity/duplicates before
> loading, either through temp data-load tables with additional SQL logic or
> using other ETL strategies.
>
> HDFS inherently isn’t suitable for all CRUD (create, read, update, delete)
> operations on data since it does not support in-place changes to files, it
> was designed more for CRAP data (create, replicate, append, purge), no pun
> intended ;-) . However, over time, you’ll see update and delete
> capabilities added to more and more SQL frameworks in hadoop including
> HAWQ, as it requires compaction and other design strategies+tradeoffs.
>
> Thanks
> -Vineet Goel
>
>
>
> On Apr 7, 2016, at 7:22 AM, hawqstudy <[email protected]> wrote:
>
> well, unfortunately no, it was a simple table without partition.
> We noticed there are many directories in HDFS for the database. Is there
> anyway we can figure out which DIR is for the data we inserted today? If we
> simply remove those directories can we resolve the problem?
>
> 在 2016年4月7日,下午9:23,Ruilong Huo <[email protected]> 写道:
>
> Is the table partitioned based on date to store incremental data? If yes,
> you can try the steps I mentioned only in the partition that is messed up.
>
> Best regards,
> Ruilong Huo
>
> On Thu, Apr 7, 2016 at 9:16 PM, hawqstudy <[email protected]> wrote:
>
>> well, there were already tons of data in the table, like billion records.
>> today our guy tried to insert daily incremental into the table they messed
>> something up and made the insertion happened twice. Each incremental was
>> about several millions. We hope there’s some way to revert what we did
>> today ( there’s date column for each record ) or just delete those rows.
>> Cheers…
>>
>> 在 2016年4月7日,下午9:02,Ruilong Huo <[email protected]> 写道:
>>
>> You mentioned that there are millions of rows. I guess it is not a big
>> number that make it taking forever:)
>> What's the # rows by estimation? What's the data size in raw indeed?
>>
>> Best regards,
>> Ruilong Huo
>>
>> On Thu, Apr 7, 2016 at 6:54 PM, hawqstudy <[email protected]> wrote:
>>
>>> uhh… there are already tons of data in the table, it will take forever
>>> to do that…
>>> any better ways?
>>> i just want to completely get rid of the last insert batch, can I
>>> directly delete some subdirectories in HDFS in order to do that?
>>>
>>> 在 2016年4月7日,下午6:41,Ruilong Huo <[email protected]> 写道:
>>>
>>> You may try with below steps to de-duplicate the data in table t:
>>> 1. insert into new_t select distinct tuple from t
>>> 2. drop table t
>>> 3. rename new_t to t
>>>
>>> Best regards,
>>> Ruilong Huo
>>>
>>> On Thu, Apr 7, 2016 at 6:36 PM, hawqstudy <[email protected]> wrote:
>>>
>>>> I inserted millions of rows into an existing table but found there’s
>>>> duplication. I would like to delete the rows based on the date column but
>>>> found there’s no delete command in Hawk.
>>>>
>>>> Is there any way I can remove the records without rebuilding the table
>>>> from scratch?
>>>>
>>>
>>>
>>>
>>
>>
>
>
>

Reply via email to