Re: How to remove entire row at the server side?

Terry P. Wed, 06 Nov 2013 14:52:00 -0800

Hi Billie,
Many thanks for your help.  I added those two methods, but had to remove
the @Override as the RowFilter class I'm extending from doesn't implement
them.  Even with these methods in place, I still get the same error trying
to add the iterator in the shell.


I notice that the RowFilter class extends WrappingIterator, which also
doesn't appear to have the describeOptions and validateOptions methods ...
should I try extending from just the Filter class?  I didn't understand the
benefits William listed of extending from the RowFilter class.  I just know
that once I identify a RowKey should be purged based on its expTs ColFam
Value, I want to remove all entries for that RowKey.


On Wed, Nov 6, 2013 at 3:29 PM, Billie Rinaldi <[email protected]>wrote:

> To use setiter in the shell, your iterator must implement
> OptionDescriber.  It has two methods, and something like the following
> should work for your iterator.  If you implement passing options to the
> iterator, you'll want to change the null parameters to the constructor of
> IteratorOptions below, and probably also to do some validation in
> validateOptions.
>
>   @Override
>   public IteratorOptions describeOptions() {
>     return new IteratorOptions("expTs", "Removes rows based on the column
> designated as the expiration timestamp column family", null, null);
>   }
>
>   @Override
>   public boolean validateOptions(Map<String,String> options) {
>     return true;
>   }
>
>
>
> On Wed, Nov 6, 2013 at 12:49 PM, Terry P. <[email protected]> wrote:
>
>> Eyes of an eagle Billie!  com is correct, but after viewing
>> "org.apache.accumulo" so many times, my brain was stuck on org and I goofed
>> in my setiter syntax.
>>
>> With THAT corrected, here is the new error:
>>
>> root@meta> setiter -class
>> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
>> 20 -scan -t itertest
>> 2013-11-06 14:46:28,280 [shell.Shell] ERROR:
>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>> not be initialized (Unable to load
>> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>> org.apache.accumulo.core.iterators.OptionDescriber; configure with 'config'
>> instead)
>>
>>
>>
>>
>>
>> On Wed, Nov 6, 2013 at 2:43 PM, Billie Rinaldi 
>> <[email protected]>wrote:
>>
>>> Is there a typo in the package name?  One place says "com" and the other
>>> "org".
>>>
>>>
>>> On Wed, Nov 6, 2013 at 12:37 PM, Terry P. <[email protected]> wrote:
>>>
>>>> Hi William, many thanks for the explanation of scan time versus
>>>> compaction time. I'll look through the classes again and note where the
>>>> remove versus suppress wordings are used and open a ticket.
>>>>
>>>> As mentioned, I only dabble in java, but regardless of that fact at
>>>> this point I'm the one that has to get this done. I've hobbled together my
>>>> first attempt, but I get the following error where I try to add it as a
>>>> scan iterator for testing:
>>>>
>>>> root@meta> setiter -class
>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
>>>> 20 -scan -t itertest
>>>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
>>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>>> not be initialized (Servers are unable to load
>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>>> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>>>>
>>>> Here's my source.  Note that the value stored in the expTs ColFam is in
>>>> the format "yyyyMMddHHmmssS", which I convert to a long for a direct
>>>> comparison to System.currentTimeMillis(). I only overrode the init and
>>>> acceptRow methods, hoping the others would work as-is from the base class.
>>>>
>>>> One clarification: turns out expTs is the ColumnFamily, and the ingest
>>>> app does not assign a ColumnQualifier for expTs. So to amend my prior table
>>>> layout (including the datetime format):
>>>>
>>>>
>>>> Format: Key:CF:CQ:Value
>>>> abc:data:title:"My fantastic data"
>>>> abc:data:content:<bytedata>
>>>> abc:creTs::20130804171412445
>>>> abc:*expTs*::20131104171412445
>>>> ... 6-8 more columns of data per row ...
>>>>
>>>> where *expTs* is the ColumnFamily to determine if the entire row
>>>> should be removed based on whether its value is <= NOW.  If a row has not
>>>> yet been assigned an expiration date, expTs will not be set and the
>>>> ColumnFamily will not yet be present.  Seems like an odd choice to use
>>>> distinct Column Families, without Column Qualifiers, but that's how the
>>>> ingest app was done.
>>>>
>>>> I greatly appreciate any advice you can provide.
>>>>
>>>> package com.esa.accumulo.iterators;
>>>>
>>>> import java.io.IOException;
>>>> import java.text.ParseException;
>>>> import java.text.SimpleDateFormat;
>>>> import java.util.Date;
>>>> import java.util.Map;
>>>>
>>>> import org.apache.accumulo.core.data.Key;
>>>> import org.apache.accumulo.core.data.Value;
>>>> import org.apache.accumulo.core.iterators.IteratorEnvironment;
>>>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
>>>> import org.apache.accumulo.core.iterators.user.RowFilter;
>>>>
>>>> /**
>>>>  * A filter that removes rows based on the column designated as the
>>>> "expiration timestamp" column family.
>>>>  *
>>>>  * It removes the row if the value in the expirationTimestamp column is
>>>> less than currentTime.
>>>>  *
>>>>  * TODO: The designation of the expirationTimestamp ColumnFamily and
>>>> its DateFormat is
>>>>  * set in the iterator options when the iterator is applied to the
>>>> table. (For
>>>>  * now it is hardcoded to match the format used in the Solr-Accumulo
>>>> plugin)
>>>>  */
>>>> public class ExpirationTimestampPurgeFilter extends RowFilter {
>>>>   private long currentTime;
>>>>   // TODO: make accumuloDateFormat settable via Iterator Options
>>>>   // Date Format for Expiration Timestamp ColumnFamily stored in
>>>> Accumulo
>>>>   private String expTsDateFormat = "yyyyMMddHHmmssS";
>>>>   SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>>>>
>>>>   // TODO: make expTs settable via Iterator Options
>>>>   // ColumnFamily containing Expiration Timestamp value (note ingest app
>>>>   // did NOT assign a ColumnQualifier, only a ColumnFamily)
>>>>   private String expTsColFam = "expTs";
>>>>
>>>>   @Override
>>>>   public boolean acceptRow(SortedKeyValueIterator<Key, Value>
>>>> rowIterator)
>>>>     throws IOException {
>>>>
>>>>     if
>>>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam)) 
>>>> {
>>>>        Date expTsDate = null;
>>>>        try {
>>>>          expTsDate = df.parse(rowIterator.getTopValue().toString());
>>>>            if (expTsDate.getTime() < currentTime)
>>>>              return false;
>>>>        } catch (ParseException e) {
>>>>          // TODO Auto-generated catch block
>>>>          e.printStackTrace();
>>>>        }
>>>>     }
>>>>     return true;
>>>>   }
>>>>
>>>>   @Override
>>>>   public void init(SortedKeyValueIterator<Key, Value> source,
>>>>       Map<String, String> options, IteratorEnvironment env) throws
>>>> IOException {
>>>>     super.init(source, options, env);
>>>>     currentTime = System.currentTimeMillis();
>>>>   }
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>> On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <
>>>> [email protected]> wrote:
>>>>
>>>>> If an iterator is only set at scan time, then its logic will only be
>>>>> applied when a client scans the table. The data will persist through major
>>>>> and minor compaction and be visible if you scanned the RFile(s) backing 
>>>>> the
>>>>> table. "Suppress" is the better word in this case. Would you please open a
>>>>> ticket pointing us where to update the documentation?
>>>>>
>>>>> It looks like you'd want to implement a RowFilter for your use case.
>>>>> It has the necessary hooks to avoid reading a whole row into memory and
>>>>> handling the logic of determining whether or not to write keys that occur
>>>>> before the column you're filtering on (at the cost of reading those keys
>>>>> twice).
>>>>>
>>>>>
>>>>> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <[email protected]> wrote:
>>>>>
>>>>>> Greetings everyone,
>>>>>> I'm looking at the AgeOffFilter as a base from which to write a
>>>>>> server-side filter / iterator to purge rows when they have aged off based
>>>>>> on the value of a specific column in the row (expiry datetime <= now). So
>>>>>> this differs from the AgeOffFilter in that the criterion for removal is
>>>>>> from the same column in every row (not the Accumulo timestamp for an
>>>>>> individual entry), and we need to remove the entire row not just 
>>>>>> individual
>>>>>> entries. For example:
>>>>>>
>>>>>> Format: Key:CF:CQ:Value
>>>>>> abc:data:title:"My fantastic data"
>>>>>> abc:data:content:<bytedata>
>>>>>> abc:data:creTs:2013-08-04T17:14:12Z
>>>>>> abc:data:*expTs*:2013-11-04T17:14:12Z
>>>>>> ... 6-8 more columns of data per row ...
>>>>>>
>>>>>> where *expTs* is the column to determine if the entire row should be
>>>>>> removed based on whether its value is <= NOW.
>>>>>>
>>>>>> This task seemed easy enough as a client program (and it is really),
>>>>>> but a server-side iterator would be far more efficient than sending
>>>>>> millions of rowkeys across the network just to delete them (we'll be
>>>>>> deleting more than a million every hour).  But I'm struggling to get 
>>>>>> there.
>>>>>>
>>>>>> In looking at AgeOffFilter.java, is the "magic" in the AgeOffFilter
>>>>>> class that removes (deletes) an entry from a table the fact that the 
>>>>>> accept
>>>>>> method returns false, combined with the fact that the iterator would be 
>>>>>> set
>>>>>> to run at -majc or -minc time and it is the compaction code that actually
>>>>>> deletes the entry?  If set to run only at scan time, would AgeOffFilter
>>>>>> simply not return the rows during the scan, but not delete them?  The
>>>>>> wording in the iterator classes varies, some saying "remove" others say
>>>>>> "suppress" so it's not clear to me
>>>>>>
>>>>>> If that's the case, then I think I know where to implement the logic.
>>>>>> The question is, how can I remove all the entries for the row once the
>>>>>> accept method has determined it meets the criteria?
>>>>>>
>>>>>> Or as Mike Drob mentioned in a prior post, will basing my class on
>>>>>> the RowFilter class instead of just Filter make things easier?  Or the
>>>>>> WholeRowIterator?  Just trying to find the simplest solution.
>>>>>>
>>>>>> Sorry for what may be obvious questions but I'm more of a DB
>>>>>> Architect that does some coding, and not a Java programmer by trade. With
>>>>>> all of the amazing things Accumulo does, honestly I was surprised when I
>>>>>> couldn't find a way to delete rows in the shell by criteria other than 
>>>>>> the
>>>>>> rowkey!  I'm more used to having a shell to 'delete from *table *where
>>>>>> *column *<= *value*'.
>>>>>>
>>>>>> But looking at it now, everyone's criteria for deletion will likely
>>>>>> be different given the flexibility of a key=>value store.  If our rowkey
>>>>>> had the date/timestamp as a prefix, I know an easy deletemany command in
>>>>>> the shell would do the trick -- but the nature of the data is such that
>>>>>> initially no expiration timestamp is set, and there is no means to update
>>>>>> the key from the client app when expiration timestamp finally gets set 
>>>>>> (too
>>>>>> much rework on that common tool I'm afraid).
>>>>>>
>>>>>> Thanks in advance.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to remove entire row at the server side?

Reply via email to