well, unfortunately no, it was a simple table without partition.
We noticed there are many directories in HDFS for the database. Is there anyway 
we can figure out which DIR is for the data we inserted today? If we simply 
remove those directories can we resolve the problem?
> 在 2016年4月7日,下午9:23,Ruilong Huo <[email protected]> 写道:
> 
> Is the table partitioned based on date to store incremental data? If yes, you 
> can try the steps I mentioned only in the partition that is messed up.
> 
> Best regards,
> Ruilong Huo
> 
> On Thu, Apr 7, 2016 at 9:16 PM, hawqstudy <[email protected] 
> <mailto:[email protected]>> wrote:
> well, there were already tons of data in the table, like billion records. 
> today our guy tried to insert daily incremental into the table they messed 
> something up and made the insertion happened twice. Each incremental was 
> about several millions. We hope there’s some way to revert what we did today 
> ( there’s date column for each record ) or just delete those rows.
> Cheers…
> 
>> 在 2016年4月7日,下午9:02,Ruilong Huo <[email protected] <mailto:[email protected]>> 写道:
>> 
>> You mentioned that there are millions of rows. I guess it is not a big 
>> number that make it taking forever:)
>> What's the # rows by estimation? What's the data size in raw indeed?
>> 
>> Best regards,
>> Ruilong Huo
>> 
>> On Thu, Apr 7, 2016 at 6:54 PM, hawqstudy <[email protected] 
>> <mailto:[email protected]>> wrote:
>> uhh… there are already tons of data in the table, it will take forever to do 
>> that…
>> any better ways?
>> i just want to completely get rid of the last insert batch, can I directly 
>> delete some subdirectories in HDFS in order to do that?
>> 
>>> 在 2016年4月7日,下午6:41,Ruilong Huo <[email protected] <mailto:[email protected]>> 
>>> 写道:
>>> 
>>> You may try with below steps to de-duplicate the data in table t:
>>> 1. insert into new_t select distinct tuple from t
>>> 2. drop table t
>>> 3. rename new_t to t
>>> 
>>> Best regards,
>>> Ruilong Huo
>>> 
>>> On Thu, Apr 7, 2016 at 6:36 PM, hawqstudy <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> I inserted millions of rows into an existing table but found there’s 
>>> duplication. I would like to delete the rows based on the date column but 
>>> found there’s no delete command in Hawk.
>>> 
>>> Is there any way I can remove the records without rebuilding the table from 
>>> scratch?
>>> 
>> 
>> 
> 
> 

Reply via email to