You shouldn’t hack the HDFS files for HAWQ as they can go out of sync with the 
HAWQ master catalog. Since you don’t have partitions to drop, your only choice 
is to CTAS and filter out unwanted rows. I would also recommend you enhance 
your data-load logic to check for data integrity/duplicates before loading, 
either through temp data-load tables with additional SQL logic or using other 
ETL strategies.

HDFS inherently isn’t suitable for all CRUD (create, read, update, delete) 
operations on data since it does not support in-place changes to files, it was 
designed more for CRAP data (create, replicate, append, purge), no pun intended 
;-) . However, over time, you’ll see update and delete capabilities added to 
more and more SQL frameworks in hadoop including HAWQ, as it requires 
compaction and other design strategies+tradeoffs.

Thanks
-Vineet Goel



On Apr 7, 2016, at 7:22 AM, hawqstudy <[email protected]> wrote:

well, unfortunately no, it was a simple table without partition.
We noticed there are many directories in HDFS for the database. Is there anyway 
we can figure out which DIR is for the data we inserted today? If we simply 
remove those directories can we resolve the problem?
> 在 2016年4月7日,下午9:23,Ruilong Huo <[email protected]> 写道:
> 
> Is the table partitioned based on date to store incremental data? If yes, you 
> can try the steps I mentioned only in the partition that is messed up.
> 
> Best regards,
> Ruilong Huo
> 
> On Thu, Apr 7, 2016 at 9:16 PM, hawqstudy <[email protected]> wrote:
> well, there were already tons of data in the table, like billion records. 
> today our guy tried to insert daily incremental into the table they messed 
> something up and made the insertion happened twice. Each incremental was 
> about several millions. We hope there’s some way to revert what we did today 
> ( there’s date column for each record ) or just delete those rows.
> Cheers…
> 
>> 在 2016年4月7日,下午9:02,Ruilong Huo <[email protected]> 写道:
>> 
>> You mentioned that there are millions of rows. I guess it is not a big 
>> number that make it taking forever:)
>> What's the # rows by estimation? What's the data size in raw indeed?
>> 
>> Best regards,
>> Ruilong Huo
>> 
>> On Thu, Apr 7, 2016 at 6:54 PM, hawqstudy <[email protected]> wrote:
>> uhh… there are already tons of data in the table, it will take forever to do 
>> that…
>> any better ways?
>> i just want to completely get rid of the last insert batch, can I directly 
>> delete some subdirectories in HDFS in order to do that?
>> 
>>> 在 2016年4月7日,下午6:41,Ruilong Huo <[email protected]> 写道:
>>> 
>>> You may try with below steps to de-duplicate the data in table t:
>>> 1. insert into new_t select distinct tuple from t
>>> 2. drop table t
>>> 3. rename new_t to t
>>> 
>>> Best regards,
>>> Ruilong Huo
>>> 
>>> On Thu, Apr 7, 2016 at 6:36 PM, hawqstudy <[email protected]> wrote:
>>> I inserted millions of rows into an existing table but found there’s 
>>> duplication. I would like to delete the rows based on the date column but 
>>> found there’s no delete command in Hawk.
>>> 
>>> Is there any way I can remove the records without rebuilding the table from 
>>> scratch?
>>> 
>> 
>> 
> 
> 


Reply via email to