Re: How to remove entire row at the server side?

Terry P. Wed, 06 Nov 2013 18:31:53 -0800

Thanks David, good to know.  After adding the implements OptionDescriber
the setiter command worked and it shows up right at the top.



On Wed, Nov 6, 2013 at 8:06 PM, David Medinets <[email protected]>wrote:

> Just in case you didn't know there is a 'classpath' command in the
> Accumulo shell which should list your custom jar. It's handy to verify that
> it was loaded. I think there might also be a log entry if you have access
> to them. I've also found it useful to use 'jar tf <filename> on the
> Accumulo nodes to verify the jar file contents. Sometimes I've deployed the
> wrong version of a jar file.
>
>
> On Wed, Nov 6, 2013 at 7:56 PM, Billie Rinaldi 
> <[email protected]>wrote:
>
>> Making your class "extends RowFilter implements OptionDescriber" should
>> be fine.  One reason it might have been complaining about the @Override
>> annotations is if the Java compiler is set to 1.5 compatibility rather than
>> 1.6.
>>
>> Regarding getting the same error, did you replace all the jars containing
>> your iterator on all the nodes?  If you did, perhaps it's not reloading the
>> jars properly.  You could restart accumulo to make sure it's using the
>> fresh jar, or you could try renaming your class and dropping it in with a
>> different jar name to ensure the new code is being picked up.
>>
>>
>> On Wed, Nov 6, 2013 at 2:50 PM, Terry P. <[email protected]> wrote:
>>
>>> Hi Billie,
>>> Many thanks for your help.  I added those two methods, but had to remove
>>> the @Override as the RowFilter class I'm extending from doesn't implement
>>> them.  Even with these methods in place, I still get the same error trying
>>> to add the iterator in the shell.
>>>
>>> I notice that the RowFilter class extends WrappingIterator, which also
>>> doesn't appear to have the describeOptions and validateOptions methods ...
>>> should I try extending from just the Filter class?  I didn't understand the
>>> benefits William listed of extending from the RowFilter class.  I just know
>>> that once I identify a RowKey should be purged based on its expTs ColFam
>>> Value, I want to remove all entries for that RowKey.
>>>
>>>
>>> On Wed, Nov 6, 2013 at 3:29 PM, Billie Rinaldi <[email protected]
>>> > wrote:
>>>
>>>> To use setiter in the shell, your iterator must implement
>>>> OptionDescriber.  It has two methods, and something like the following
>>>> should work for your iterator.  If you implement passing options to the
>>>> iterator, you'll want to change the null parameters to the constructor of
>>>> IteratorOptions below, and probably also to do some validation in
>>>> validateOptions.
>>>>
>>>>   @Override
>>>>   public IteratorOptions describeOptions() {
>>>>     return new IteratorOptions("expTs", "Removes rows based on the
>>>> column designated as the expiration timestamp column family", null, null);
>>>>   }
>>>>
>>>>   @Override
>>>>   public boolean validateOptions(Map<String,String> options) {
>>>>     return true;
>>>>   }
>>>>
>>>>
>>>>
>>>> On Wed, Nov 6, 2013 at 12:49 PM, Terry P. <[email protected]> wrote:
>>>>
>>>>> Eyes of an eagle Billie!  com is correct, but after viewing
>>>>> "org.apache.accumulo" so many times, my brain was stuck on org and I 
>>>>> goofed
>>>>> in my setiter syntax.
>>>>>
>>>>> With THAT corrected, here is the new error:
>>>>>
>>>>> root@meta> setiter -class
>>>>> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter 
>>>>> -p
>>>>> 20 -scan -t itertest
>>>>> 2013-11-06 14:46:28,280 [shell.Shell] ERROR:
>>>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>>>> not be initialized (Unable to load
>>>>> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>>>> org.apache.accumulo.core.iterators.OptionDescriber; configure with 
>>>>> 'config'
>>>>> instead)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 6, 2013 at 2:43 PM, Billie Rinaldi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Is there a typo in the package name?  One place says "com" and the
>>>>>> other "org".
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 6, 2013 at 12:37 PM, Terry P. <[email protected]> wrote:
>>>>>>
>>>>>>> Hi William, many thanks for the explanation of scan time versus
>>>>>>> compaction time. I'll look through the classes again and note where the
>>>>>>> remove versus suppress wordings are used and open a ticket.
>>>>>>>
>>>>>>> As mentioned, I only dabble in java, but regardless of that fact at
>>>>>>> this point I'm the one that has to get this done. I've hobbled together 
>>>>>>> my
>>>>>>> first attempt, but I get the following error where I try to add it as a
>>>>>>> scan iterator for testing:
>>>>>>>
>>>>>>> root@meta> setiter -class
>>>>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n 
>>>>>>> expTsFilter -p
>>>>>>> 20 -scan -t itertest
>>>>>>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
>>>>>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>>>>>> not be initialized (Servers are unable to load
>>>>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>>>>>> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>>>>>>>
>>>>>>> Here's my source.  Note that the value stored in the expTs ColFam is
>>>>>>> in the format "yyyyMMddHHmmssS", which I convert to a long for a direct
>>>>>>> comparison to System.currentTimeMillis(). I only overrode the init and
>>>>>>> acceptRow methods, hoping the others would work as-is from the base 
>>>>>>> class.
>>>>>>>
>>>>>>> One clarification: turns out expTs is the ColumnFamily, and the
>>>>>>> ingest app does not assign a ColumnQualifier for expTs. So to amend my
>>>>>>> prior table layout (including the datetime format):
>>>>>>>
>>>>>>>
>>>>>>> Format: Key:CF:CQ:Value
>>>>>>> abc:data:title:"My fantastic data"
>>>>>>> abc:data:content:<bytedata>
>>>>>>> abc:creTs::20130804171412445
>>>>>>> abc:*expTs*::20131104171412445
>>>>>>> ... 6-8 more columns of data per row ...
>>>>>>>
>>>>>>> where *expTs* is the ColumnFamily to determine if the entire row
>>>>>>> should be removed based on whether its value is <= NOW.  If a row has 
>>>>>>> not
>>>>>>> yet been assigned an expiration date, expTs will not be set and the
>>>>>>> ColumnFamily will not yet be present.  Seems like an odd choice to use
>>>>>>> distinct Column Families, without Column Qualifiers, but that's how the
>>>>>>> ingest app was done.
>>>>>>>
>>>>>>> I greatly appreciate any advice you can provide.
>>>>>>>
>>>>>>> package com.esa.accumulo.iterators;
>>>>>>>
>>>>>>> import java.io.IOException;
>>>>>>> import java.text.ParseException;
>>>>>>> import java.text.SimpleDateFormat;
>>>>>>> import java.util.Date;
>>>>>>> import java.util.Map;
>>>>>>>
>>>>>>> import org.apache.accumulo.core.data.Key;
>>>>>>> import org.apache.accumulo.core.data.Value;
>>>>>>> import org.apache.accumulo.core.iterators.IteratorEnvironment;
>>>>>>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
>>>>>>> import org.apache.accumulo.core.iterators.user.RowFilter;
>>>>>>>
>>>>>>> /**
>>>>>>>  * A filter that removes rows based on the column designated as the
>>>>>>> "expiration timestamp" column family.
>>>>>>>  *
>>>>>>>  * It removes the row if the value in the expirationTimestamp column
>>>>>>> is less than currentTime.
>>>>>>>  *
>>>>>>>  * TODO: The designation of the expirationTimestamp ColumnFamily and
>>>>>>> its DateFormat is
>>>>>>>  * set in the iterator options when the iterator is applied to the
>>>>>>> table. (For
>>>>>>>  * now it is hardcoded to match the format used in the Solr-Accumulo
>>>>>>> plugin)
>>>>>>>  */
>>>>>>> public class ExpirationTimestampPurgeFilter extends RowFilter {
>>>>>>>   private long currentTime;
>>>>>>>   // TODO: make accumuloDateFormat settable via Iterator Options
>>>>>>>   // Date Format for Expiration Timestamp ColumnFamily stored in
>>>>>>> Accumulo
>>>>>>>   private String expTsDateFormat = "yyyyMMddHHmmssS";
>>>>>>>   SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>>>>>>>
>>>>>>>   // TODO: make expTs settable via Iterator Options
>>>>>>>   // ColumnFamily containing Expiration Timestamp value (note ingest
>>>>>>> app
>>>>>>>   // did NOT assign a ColumnQualifier, only a ColumnFamily)
>>>>>>>   private String expTsColFam = "expTs";
>>>>>>>
>>>>>>>   @Override
>>>>>>>   public boolean acceptRow(SortedKeyValueIterator<Key, Value>
>>>>>>> rowIterator)
>>>>>>>     throws IOException {
>>>>>>>
>>>>>>>     if
>>>>>>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam))
>>>>>>>  {
>>>>>>>        Date expTsDate = null;
>>>>>>>        try {
>>>>>>>          expTsDate = df.parse(rowIterator.getTopValue().toString());
>>>>>>>            if (expTsDate.getTime() < currentTime)
>>>>>>>              return false;
>>>>>>>        } catch (ParseException e) {
>>>>>>>          // TODO Auto-generated catch block
>>>>>>>          e.printStackTrace();
>>>>>>>        }
>>>>>>>     }
>>>>>>>     return true;
>>>>>>>   }
>>>>>>>
>>>>>>>   @Override
>>>>>>>   public void init(SortedKeyValueIterator<Key, Value> source,
>>>>>>>       Map<String, String> options, IteratorEnvironment env) throws
>>>>>>> IOException {
>>>>>>>     super.init(source, options, env);
>>>>>>>     currentTime = System.currentTimeMillis();
>>>>>>>   }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> If an iterator is only set at scan time, then its logic will only
>>>>>>>> be applied when a client scans the table. The data will persist through
>>>>>>>> major and minor compaction and be visible if you scanned the RFile(s)
>>>>>>>> backing the table. "Suppress" is the better word in this case. Would 
>>>>>>>> you
>>>>>>>> please open a ticket pointing us where to update the documentation?
>>>>>>>>
>>>>>>>> It looks like you'd want to implement a RowFilter for your use
>>>>>>>> case. It has the necessary hooks to avoid reading a whole row into 
>>>>>>>> memory
>>>>>>>> and handling the logic of determining whether or not to write keys that
>>>>>>>> occur before the column you're filtering on (at the cost of reading 
>>>>>>>> those
>>>>>>>> keys twice).
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Greetings everyone,
>>>>>>>>> I'm looking at the AgeOffFilter as a base from which to write a
>>>>>>>>> server-side filter / iterator to purge rows when they have aged off 
>>>>>>>>> based
>>>>>>>>> on the value of a specific column in the row (expiry datetime <= 
>>>>>>>>> now). So
>>>>>>>>> this differs from the AgeOffFilter in that the criterion for removal 
>>>>>>>>> is
>>>>>>>>> from the same column in every row (not the Accumulo timestamp for an
>>>>>>>>> individual entry), and we need to remove the entire row not just 
>>>>>>>>> individual
>>>>>>>>> entries. For example:
>>>>>>>>>
>>>>>>>>> Format: Key:CF:CQ:Value
>>>>>>>>> abc:data:title:"My fantastic data"
>>>>>>>>> abc:data:content:<bytedata>
>>>>>>>>> abc:data:creTs:2013-08-04T17:14:12Z
>>>>>>>>> abc:data:*expTs*:2013-11-04T17:14:12Z
>>>>>>>>> ... 6-8 more columns of data per row ...
>>>>>>>>>
>>>>>>>>> where *expTs* is the column to determine if the entire row should
>>>>>>>>> be removed based on whether its value is <= NOW.
>>>>>>>>>
>>>>>>>>> This task seemed easy enough as a client program (and it is
>>>>>>>>> really), but a server-side iterator would be far more efficient than
>>>>>>>>> sending millions of rowkeys across the network just to delete them 
>>>>>>>>> (we'll
>>>>>>>>> be deleting more than a million every hour).  But I'm struggling to 
>>>>>>>>> get
>>>>>>>>> there.
>>>>>>>>>
>>>>>>>>> In looking at AgeOffFilter.java, is the "magic" in the
>>>>>>>>> AgeOffFilter class that removes (deletes) an entry from a table the 
>>>>>>>>> fact
>>>>>>>>> that the accept method returns false, combined with the fact that the
>>>>>>>>> iterator would be set to run at -majc or -minc time and it is the
>>>>>>>>> compaction code that actually deletes the entry?  If set to run only 
>>>>>>>>> at
>>>>>>>>> scan time, would AgeOffFilter simply not return the rows during the 
>>>>>>>>> scan,
>>>>>>>>> but not delete them?  The wording in the iterator classes varies, some
>>>>>>>>> saying "remove" others say "suppress" so it's not clear to me
>>>>>>>>>
>>>>>>>>> If that's the case, then I think I know where to implement the
>>>>>>>>> logic. The question is, how can I remove all the entries for the row 
>>>>>>>>> once
>>>>>>>>> the accept method has determined it meets the criteria?
>>>>>>>>>
>>>>>>>>> Or as Mike Drob mentioned in a prior post, will basing my class on
>>>>>>>>> the RowFilter class instead of just Filter make things easier?  Or the
>>>>>>>>> WholeRowIterator?  Just trying to find the simplest solution.
>>>>>>>>>
>>>>>>>>> Sorry for what may be obvious questions but I'm more of a DB
>>>>>>>>> Architect that does some coding, and not a Java programmer by trade. 
>>>>>>>>> With
>>>>>>>>> all of the amazing things Accumulo does, honestly I was surprised 
>>>>>>>>> when I
>>>>>>>>> couldn't find a way to delete rows in the shell by criteria other 
>>>>>>>>> than the
>>>>>>>>> rowkey!  I'm more used to having a shell to 'delete from *table *where
>>>>>>>>> *column *<= *value*'.
>>>>>>>>>
>>>>>>>>> But looking at it now, everyone's criteria for deletion will
>>>>>>>>> likely be different given the flexibility of a key=>value store.  If 
>>>>>>>>> our
>>>>>>>>> rowkey had the date/timestamp as a prefix, I know an easy deletemany
>>>>>>>>> command in the shell would do the trick -- but the nature of the data 
>>>>>>>>> is
>>>>>>>>> such that initially no expiration timestamp is set, and there is no 
>>>>>>>>> means
>>>>>>>>> to update the key from the client app when expiration timestamp 
>>>>>>>>> finally
>>>>>>>>> gets set (too much rework on that common tool I'm afraid).
>>>>>>>>>
>>>>>>>>> Thanks in advance.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to remove entire row at the server side?

Reply via email to