Is there a typo in the package name? One place says "com" and the other "org".
On Wed, Nov 6, 2013 at 12:37 PM, Terry P. <[email protected]> wrote: > Hi William, many thanks for the explanation of scan time versus compaction > time. I'll look through the classes again and note where the remove versus > suppress wordings are used and open a ticket. > > As mentioned, I only dabble in java, but regardless of that fact at this > point I'm the one that has to get this done. I've hobbled together my first > attempt, but I get the following error where I try to add it as a scan > iterator for testing: > > root@meta> setiter -class > org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p > 20 -scan -t itertest > 2013-11-06 14:06:34,914 [shell.Shell] ERROR: > org.apache.accumulo.core.util.shell.ShellCommandException: Command could > not be initialized (Servers are unable to load > org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type > org.apache.accumulo.core.iterators.SortedKeyValueIterator) > > Here's my source. Note that the value stored in the expTs ColFam is in > the format "yyyyMMddHHmmssS", which I convert to a long for a direct > comparison to System.currentTimeMillis(). I only overrode the init and > acceptRow methods, hoping the others would work as-is from the base class. > > One clarification: turns out expTs is the ColumnFamily, and the ingest app > does not assign a ColumnQualifier for expTs. So to amend my prior table > layout (including the datetime format): > > > Format: Key:CF:CQ:Value > abc:data:title:"My fantastic data" > abc:data:content:<bytedata> > abc:creTs::20130804171412445 > abc:*expTs*::20131104171412445 > ... 6-8 more columns of data per row ... > > where *expTs* is the ColumnFamily to determine if the entire row should > be removed based on whether its value is <= NOW. If a row has not yet been > assigned an expiration date, expTs will not be set and the ColumnFamily > will not yet be present. Seems like an odd choice to use distinct Column > Families, without Column Qualifiers, but that's how the ingest app was done. > > I greatly appreciate any advice you can provide. > > package com.esa.accumulo.iterators; > > import java.io.IOException; > import java.text.ParseException; > import java.text.SimpleDateFormat; > import java.util.Date; > import java.util.Map; > > import org.apache.accumulo.core.data.Key; > import org.apache.accumulo.core.data.Value; > import org.apache.accumulo.core.iterators.IteratorEnvironment; > import org.apache.accumulo.core.iterators.SortedKeyValueIterator; > import org.apache.accumulo.core.iterators.user.RowFilter; > > /** > * A filter that removes rows based on the column designated as the > "expiration timestamp" column family. > * > * It removes the row if the value in the expirationTimestamp column is > less than currentTime. > * > * TODO: The designation of the expirationTimestamp ColumnFamily and its > DateFormat is > * set in the iterator options when the iterator is applied to the table. > (For > * now it is hardcoded to match the format used in the Solr-Accumulo > plugin) > */ > public class ExpirationTimestampPurgeFilter extends RowFilter { > private long currentTime; > // TODO: make accumuloDateFormat settable via Iterator Options > // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo > private String expTsDateFormat = "yyyyMMddHHmmssS"; > SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat); > > // TODO: make expTs settable via Iterator Options > // ColumnFamily containing Expiration Timestamp value (note ingest app > // did NOT assign a ColumnQualifier, only a ColumnFamily) > private String expTsColFam = "expTs"; > > @Override > public boolean acceptRow(SortedKeyValueIterator<Key, Value> rowIterator) > throws IOException { > > if > (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam)) { > Date expTsDate = null; > try { > expTsDate = df.parse(rowIterator.getTopValue().toString()); > if (expTsDate.getTime() < currentTime) > return false; > } catch (ParseException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > } > return true; > } > > @Override > public void init(SortedKeyValueIterator<Key, Value> source, > Map<String, String> options, IteratorEnvironment env) throws > IOException { > super.init(source, options, env); > currentTime = System.currentTimeMillis(); > } > > } > > > > On Tue, Nov 5, 2013 at 8:48 PM, William Slacum < > [email protected]> wrote: > >> If an iterator is only set at scan time, then its logic will only be >> applied when a client scans the table. The data will persist through major >> and minor compaction and be visible if you scanned the RFile(s) backing the >> table. "Suppress" is the better word in this case. Would you please open a >> ticket pointing us where to update the documentation? >> >> It looks like you'd want to implement a RowFilter for your use case. It >> has the necessary hooks to avoid reading a whole row into memory and >> handling the logic of determining whether or not to write keys that occur >> before the column you're filtering on (at the cost of reading those keys >> twice). >> >> >> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <[email protected]> wrote: >> >>> Greetings everyone, >>> I'm looking at the AgeOffFilter as a base from which to write a >>> server-side filter / iterator to purge rows when they have aged off based >>> on the value of a specific column in the row (expiry datetime <= now). So >>> this differs from the AgeOffFilter in that the criterion for removal is >>> from the same column in every row (not the Accumulo timestamp for an >>> individual entry), and we need to remove the entire row not just individual >>> entries. For example: >>> >>> Format: Key:CF:CQ:Value >>> abc:data:title:"My fantastic data" >>> abc:data:content:<bytedata> >>> abc:data:creTs:2013-08-04T17:14:12Z >>> abc:data:*expTs*:2013-11-04T17:14:12Z >>> ... 6-8 more columns of data per row ... >>> >>> where *expTs* is the column to determine if the entire row should be >>> removed based on whether its value is <= NOW. >>> >>> This task seemed easy enough as a client program (and it is really), but >>> a server-side iterator would be far more efficient than sending millions of >>> rowkeys across the network just to delete them (we'll be deleting more than >>> a million every hour). But I'm struggling to get there. >>> >>> In looking at AgeOffFilter.java, is the "magic" in the AgeOffFilter >>> class that removes (deletes) an entry from a table the fact that the accept >>> method returns false, combined with the fact that the iterator would be set >>> to run at -majc or -minc time and it is the compaction code that actually >>> deletes the entry? If set to run only at scan time, would AgeOffFilter >>> simply not return the rows during the scan, but not delete them? The >>> wording in the iterator classes varies, some saying "remove" others say >>> "suppress" so it's not clear to me >>> >>> If that's the case, then I think I know where to implement the logic. >>> The question is, how can I remove all the entries for the row once the >>> accept method has determined it meets the criteria? >>> >>> Or as Mike Drob mentioned in a prior post, will basing my class on the >>> RowFilter class instead of just Filter make things easier? Or the >>> WholeRowIterator? Just trying to find the simplest solution. >>> >>> Sorry for what may be obvious questions but I'm more of a DB Architect >>> that does some coding, and not a Java programmer by trade. With all of the >>> amazing things Accumulo does, honestly I was surprised when I couldn't find >>> a way to delete rows in the shell by criteria other than the rowkey! I'm >>> more used to having a shell to 'delete from *table *where *column *<= >>> *value*'. >>> >>> But looking at it now, everyone's criteria for deletion will likely be >>> different given the flexibility of a key=>value store. If our rowkey had >>> the date/timestamp as a prefix, I know an easy deletemany command in the >>> shell would do the trick -- but the nature of the data is such that >>> initially no expiration timestamp is set, and there is no means to update >>> the key from the client app when expiration timestamp finally gets set (too >>> much rework on that common tool I'm afraid). >>> >>> Thanks in advance. >>> >> >> >
