Re: How to remove entire row at the server side?

David Medinets Wed, 06 Nov 2013 18:06:54 -0800

Just in case you didn't know there is a 'classpath' command in the Accumulo
shell which should list your custom jar. It's handy to verify that it was
loaded. I think there might also be a log entry if you have access to them.
I've also found it useful to use 'jar tf <filename> on the Accumulo nodes
to verify the jar file contents. Sometimes I've deployed the wrong version
of a jar file.



On Wed, Nov 6, 2013 at 7:56 PM, Billie Rinaldi <[email protected]>wrote:

> Making your class "extends RowFilter implements OptionDescriber" should be
> fine.  One reason it might have been complaining about the @Override
> annotations is if the Java compiler is set to 1.5 compatibility rather than
> 1.6.
>
> Regarding getting the same error, did you replace all the jars containing
> your iterator on all the nodes?  If you did, perhaps it's not reloading the
> jars properly.  You could restart accumulo to make sure it's using the
> fresh jar, or you could try renaming your class and dropping it in with a
> different jar name to ensure the new code is being picked up.
>
>
> On Wed, Nov 6, 2013 at 2:50 PM, Terry P. <[email protected]> wrote:
>
>> Hi Billie,
>> Many thanks for your help.  I added those two methods, but had to remove
>> the @Override as the RowFilter class I'm extending from doesn't implement
>> them.  Even with these methods in place, I still get the same error trying
>> to add the iterator in the shell.
>>
>> I notice that the RowFilter class extends WrappingIterator, which also
>> doesn't appear to have the describeOptions and validateOptions methods ...
>> should I try extending from just the Filter class?  I didn't understand the
>> benefits William listed of extending from the RowFilter class.  I just know
>> that once I identify a RowKey should be purged based on its expTs ColFam
>> Value, I want to remove all entries for that RowKey.
>>
>>
>> On Wed, Nov 6, 2013 at 3:29 PM, Billie Rinaldi 
>> <[email protected]>wrote:
>>
>>> To use setiter in the shell, your iterator must implement
>>> OptionDescriber.  It has two methods, and something like the following
>>> should work for your iterator.  If you implement passing options to the
>>> iterator, you'll want to change the null parameters to the constructor of
>>> IteratorOptions below, and probably also to do some validation in
>>> validateOptions.
>>>
>>>   @Override
>>>   public IteratorOptions describeOptions() {
>>>     return new IteratorOptions("expTs", "Removes rows based on the
>>> column designated as the expiration timestamp column family", null, null);
>>>   }
>>>
>>>   @Override
>>>   public boolean validateOptions(Map<String,String> options) {
>>>     return true;
>>>   }
>>>
>>>
>>>
>>> On Wed, Nov 6, 2013 at 12:49 PM, Terry P. <[email protected]> wrote:
>>>
>>>> Eyes of an eagle Billie!  com is correct, but after viewing
>>>> "org.apache.accumulo" so many times, my brain was stuck on org and I goofed
>>>> in my setiter syntax.
>>>>
>>>> With THAT corrected, here is the new error:
>>>>
>>>> root@meta> setiter -class
>>>> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
>>>> 20 -scan -t itertest
>>>> 2013-11-06 14:46:28,280 [shell.Shell] ERROR:
>>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>>> not be initialized (Unable to load
>>>> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>>> org.apache.accumulo.core.iterators.OptionDescriber; configure with 'config'
>>>> instead)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 6, 2013 at 2:43 PM, Billie Rinaldi <
>>>> [email protected]> wrote:
>>>>
>>>>> Is there a typo in the package name?  One place says "com" and the
>>>>> other "org".
>>>>>
>>>>>
>>>>> On Wed, Nov 6, 2013 at 12:37 PM, Terry P. <[email protected]> wrote:
>>>>>
>>>>>> Hi William, many thanks for the explanation of scan time versus
>>>>>> compaction time. I'll look through the classes again and note where the
>>>>>> remove versus suppress wordings are used and open a ticket.
>>>>>>
>>>>>> As mentioned, I only dabble in java, but regardless of that fact at
>>>>>> this point I'm the one that has to get this done. I've hobbled together 
>>>>>> my
>>>>>> first attempt, but I get the following error where I try to add it as a
>>>>>> scan iterator for testing:
>>>>>>
>>>>>> root@meta> setiter -class
>>>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter 
>>>>>> -p
>>>>>> 20 -scan -t itertest
>>>>>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
>>>>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>>>>> not be initialized (Servers are unable to load
>>>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>>>>> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>>>>>>
>>>>>> Here's my source.  Note that the value stored in the expTs ColFam is
>>>>>> in the format "yyyyMMddHHmmssS", which I convert to a long for a direct
>>>>>> comparison to System.currentTimeMillis(). I only overrode the init and
>>>>>> acceptRow methods, hoping the others would work as-is from the base 
>>>>>> class.
>>>>>>
>>>>>> One clarification: turns out expTs is the ColumnFamily, and the
>>>>>> ingest app does not assign a ColumnQualifier for expTs. So to amend my
>>>>>> prior table layout (including the datetime format):
>>>>>>
>>>>>>
>>>>>> Format: Key:CF:CQ:Value
>>>>>> abc:data:title:"My fantastic data"
>>>>>> abc:data:content:<bytedata>
>>>>>> abc:creTs::20130804171412445
>>>>>> abc:*expTs*::20131104171412445
>>>>>> ... 6-8 more columns of data per row ...
>>>>>>
>>>>>> where *expTs* is the ColumnFamily to determine if the entire row
>>>>>> should be removed based on whether its value is <= NOW.  If a row has not
>>>>>> yet been assigned an expiration date, expTs will not be set and the
>>>>>> ColumnFamily will not yet be present.  Seems like an odd choice to use
>>>>>> distinct Column Families, without Column Qualifiers, but that's how the
>>>>>> ingest app was done.
>>>>>>
>>>>>> I greatly appreciate any advice you can provide.
>>>>>>
>>>>>> package com.esa.accumulo.iterators;
>>>>>>
>>>>>> import java.io.IOException;
>>>>>> import java.text.ParseException;
>>>>>> import java.text.SimpleDateFormat;
>>>>>> import java.util.Date;
>>>>>> import java.util.Map;
>>>>>>
>>>>>> import org.apache.accumulo.core.data.Key;
>>>>>> import org.apache.accumulo.core.data.Value;
>>>>>> import org.apache.accumulo.core.iterators.IteratorEnvironment;
>>>>>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
>>>>>> import org.apache.accumulo.core.iterators.user.RowFilter;
>>>>>>
>>>>>> /**
>>>>>>  * A filter that removes rows based on the column designated as the
>>>>>> "expiration timestamp" column family.
>>>>>>  *
>>>>>>  * It removes the row if the value in the expirationTimestamp column
>>>>>> is less than currentTime.
>>>>>>  *
>>>>>>  * TODO: The designation of the expirationTimestamp ColumnFamily and
>>>>>> its DateFormat is
>>>>>>  * set in the iterator options when the iterator is applied to the
>>>>>> table. (For
>>>>>>  * now it is hardcoded to match the format used in the Solr-Accumulo
>>>>>> plugin)
>>>>>>  */
>>>>>> public class ExpirationTimestampPurgeFilter extends RowFilter {
>>>>>>   private long currentTime;
>>>>>>   // TODO: make accumuloDateFormat settable via Iterator Options
>>>>>>   // Date Format for Expiration Timestamp ColumnFamily stored in
>>>>>> Accumulo
>>>>>>   private String expTsDateFormat = "yyyyMMddHHmmssS";
>>>>>>   SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>>>>>>
>>>>>>   // TODO: make expTs settable via Iterator Options
>>>>>>   // ColumnFamily containing Expiration Timestamp value (note ingest
>>>>>> app
>>>>>>   // did NOT assign a ColumnQualifier, only a ColumnFamily)
>>>>>>   private String expTsColFam = "expTs";
>>>>>>
>>>>>>   @Override
>>>>>>   public boolean acceptRow(SortedKeyValueIterator<Key, Value>
>>>>>> rowIterator)
>>>>>>     throws IOException {
>>>>>>
>>>>>>     if
>>>>>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam))
>>>>>>  {
>>>>>>        Date expTsDate = null;
>>>>>>        try {
>>>>>>          expTsDate = df.parse(rowIterator.getTopValue().toString());
>>>>>>            if (expTsDate.getTime() < currentTime)
>>>>>>              return false;
>>>>>>        } catch (ParseException e) {
>>>>>>          // TODO Auto-generated catch block
>>>>>>          e.printStackTrace();
>>>>>>        }
>>>>>>     }
>>>>>>     return true;
>>>>>>   }
>>>>>>
>>>>>>   @Override
>>>>>>   public void init(SortedKeyValueIterator<Key, Value> source,
>>>>>>       Map<String, String> options, IteratorEnvironment env) throws
>>>>>> IOException {
>>>>>>     super.init(source, options, env);
>>>>>>     currentTime = System.currentTimeMillis();
>>>>>>   }
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> If an iterator is only set at scan time, then its logic will only be
>>>>>>> applied when a client scans the table. The data will persist through 
>>>>>>> major
>>>>>>> and minor compaction and be visible if you scanned the RFile(s) backing 
>>>>>>> the
>>>>>>> table. "Suppress" is the better word in this case. Would you please 
>>>>>>> open a
>>>>>>> ticket pointing us where to update the documentation?
>>>>>>>
>>>>>>> It looks like you'd want to implement a RowFilter for your use case.
>>>>>>> It has the necessary hooks to avoid reading a whole row into memory and
>>>>>>> handling the logic of determining whether or not to write keys that 
>>>>>>> occur
>>>>>>> before the column you're filtering on (at the cost of reading those keys
>>>>>>> twice).
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <[email protected]> wrote:
>>>>>>>
>>>>>>>> Greetings everyone,
>>>>>>>> I'm looking at the AgeOffFilter as a base from which to write a
>>>>>>>> server-side filter / iterator to purge rows when they have aged off 
>>>>>>>> based
>>>>>>>> on the value of a specific column in the row (expiry datetime <= now). 
>>>>>>>> So
>>>>>>>> this differs from the AgeOffFilter in that the criterion for removal is
>>>>>>>> from the same column in every row (not the Accumulo timestamp for an
>>>>>>>> individual entry), and we need to remove the entire row not just 
>>>>>>>> individual
>>>>>>>> entries. For example:
>>>>>>>>
>>>>>>>> Format: Key:CF:CQ:Value
>>>>>>>> abc:data:title:"My fantastic data"
>>>>>>>> abc:data:content:<bytedata>
>>>>>>>> abc:data:creTs:2013-08-04T17:14:12Z
>>>>>>>> abc:data:*expTs*:2013-11-04T17:14:12Z
>>>>>>>> ... 6-8 more columns of data per row ...
>>>>>>>>
>>>>>>>> where *expTs* is the column to determine if the entire row should
>>>>>>>> be removed based on whether its value is <= NOW.
>>>>>>>>
>>>>>>>> This task seemed easy enough as a client program (and it is
>>>>>>>> really), but a server-side iterator would be far more efficient than
>>>>>>>> sending millions of rowkeys across the network just to delete them 
>>>>>>>> (we'll
>>>>>>>> be deleting more than a million every hour).  But I'm struggling to get
>>>>>>>> there.
>>>>>>>>
>>>>>>>> In looking at AgeOffFilter.java, is the "magic" in the AgeOffFilter
>>>>>>>> class that removes (deletes) an entry from a table the fact that the 
>>>>>>>> accept
>>>>>>>> method returns false, combined with the fact that the iterator would 
>>>>>>>> be set
>>>>>>>> to run at -majc or -minc time and it is the compaction code that 
>>>>>>>> actually
>>>>>>>> deletes the entry?  If set to run only at scan time, would AgeOffFilter
>>>>>>>> simply not return the rows during the scan, but not delete them?  The
>>>>>>>> wording in the iterator classes varies, some saying "remove" others say
>>>>>>>> "suppress" so it's not clear to me
>>>>>>>>
>>>>>>>> If that's the case, then I think I know where to implement the
>>>>>>>> logic. The question is, how can I remove all the entries for the row 
>>>>>>>> once
>>>>>>>> the accept method has determined it meets the criteria?
>>>>>>>>
>>>>>>>> Or as Mike Drob mentioned in a prior post, will basing my class on
>>>>>>>> the RowFilter class instead of just Filter make things easier?  Or the
>>>>>>>> WholeRowIterator?  Just trying to find the simplest solution.
>>>>>>>>
>>>>>>>> Sorry for what may be obvious questions but I'm more of a DB
>>>>>>>> Architect that does some coding, and not a Java programmer by trade. 
>>>>>>>> With
>>>>>>>> all of the amazing things Accumulo does, honestly I was surprised when 
>>>>>>>> I
>>>>>>>> couldn't find a way to delete rows in the shell by criteria other than 
>>>>>>>> the
>>>>>>>> rowkey!  I'm more used to having a shell to 'delete from *table *where
>>>>>>>> *column *<= *value*'.
>>>>>>>>
>>>>>>>> But looking at it now, everyone's criteria for deletion will likely
>>>>>>>> be different given the flexibility of a key=>value store.  If our 
>>>>>>>> rowkey
>>>>>>>> had the date/timestamp as a prefix, I know an easy deletemany command 
>>>>>>>> in
>>>>>>>> the shell would do the trick -- but the nature of the data is such that
>>>>>>>> initially no expiration timestamp is set, and there is no means to 
>>>>>>>> update
>>>>>>>> the key from the client app when expiration timestamp finally gets set 
>>>>>>>> (too
>>>>>>>> much rework on that common tool I'm afraid).
>>>>>>>>
>>>>>>>> Thanks in advance.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to remove entire row at the server side?

Reply via email to