An alternative: 1. insert into new_t select tuple from t where date != duplicates date 2. insert into new_t select distinct tuple from t where date = duplicates date 3. drop table t 4. rename new_t to t
Hadoop does not allow updates/deletes of rows, it's highly unlikely that all your duplicates are in one file. 2016-04-07 17:36 GMT+02:00 Vineet Goel <[email protected]>: > You shouldn’t hack the HDFS files for HAWQ as they can go out of sync with > the HAWQ master catalog. Since you don’t have partitions to drop, your only > choice is to CTAS and filter out unwanted rows. I would also recommend you > enhance your data-load logic to check for data integrity/duplicates before > loading, either through temp data-load tables with additional SQL logic or > using other ETL strategies. > > HDFS inherently isn’t suitable for all CRUD (create, read, update, delete) > operations on data since it does not support in-place changes to files, it > was designed more for CRAP data (create, replicate, append, purge), no pun > intended ;-) . However, over time, you’ll see update and delete > capabilities added to more and more SQL frameworks in hadoop including > HAWQ, as it requires compaction and other design strategies+tradeoffs. > > Thanks > -Vineet Goel > > > > On Apr 7, 2016, at 7:22 AM, hawqstudy <[email protected]> wrote: > > well, unfortunately no, it was a simple table without partition. > We noticed there are many directories in HDFS for the database. Is there > anyway we can figure out which DIR is for the data we inserted today? If we > simply remove those directories can we resolve the problem? > > 在 2016年4月7日,下午9:23,Ruilong Huo <[email protected]> 写道: > > Is the table partitioned based on date to store incremental data? If yes, > you can try the steps I mentioned only in the partition that is messed up. > > Best regards, > Ruilong Huo > > On Thu, Apr 7, 2016 at 9:16 PM, hawqstudy <[email protected]> wrote: > >> well, there were already tons of data in the table, like billion records. >> today our guy tried to insert daily incremental into the table they messed >> something up and made the insertion happened twice. Each incremental was >> about several millions. We hope there’s some way to revert what we did >> today ( there’s date column for each record ) or just delete those rows. >> Cheers… >> >> 在 2016年4月7日,下午9:02,Ruilong Huo <[email protected]> 写道: >> >> You mentioned that there are millions of rows. I guess it is not a big >> number that make it taking forever:) >> What's the # rows by estimation? What's the data size in raw indeed? >> >> Best regards, >> Ruilong Huo >> >> On Thu, Apr 7, 2016 at 6:54 PM, hawqstudy <[email protected]> wrote: >> >>> uhh… there are already tons of data in the table, it will take forever >>> to do that… >>> any better ways? >>> i just want to completely get rid of the last insert batch, can I >>> directly delete some subdirectories in HDFS in order to do that? >>> >>> 在 2016年4月7日,下午6:41,Ruilong Huo <[email protected]> 写道: >>> >>> You may try with below steps to de-duplicate the data in table t: >>> 1. insert into new_t select distinct tuple from t >>> 2. drop table t >>> 3. rename new_t to t >>> >>> Best regards, >>> Ruilong Huo >>> >>> On Thu, Apr 7, 2016 at 6:36 PM, hawqstudy <[email protected]> wrote: >>> >>>> I inserted millions of rows into an existing table but found there’s >>>> duplication. I would like to delete the rows based on the date column but >>>> found there’s no delete command in Hawk. >>>> >>>> Is there any way I can remove the records without rebuilding the table >>>> from scratch? >>>> >>> >>> >>> >> >> > > >
