On Fri, Nov 8, 2013 at 12:26 AM, Terry P. <[email protected]> wrote:
> Hi Keith,
> Given that what I need to filter on is only the expTs column family, would
> it be faster to seek? I don't know how to seek, but I also can't figure
> out how to iterate inside the acceptRow method -- there's no scanner as I
> normally use when reading and iterating over key/values.
>
I think in your case you should iterate, because you are only advancing a
few keys. Some of the iterators in Accumulo try to iterator for a few
keys, and if something is not found in 10 keys then seek.
Could do something like the following.
ByteSequence expTsColFam = new ArrayByteSequence("expTs");
@Override
public boolean acceptRow(SortedKeyValueIterator<Key, Value>
rowIterator) throws IOException {
while(rowIterator.hasTop()){
//this is faster becaue it compares against byte arrays in key
directly w/o converting to string
int cmp
= rowIterator.getTopKey().getColumnFamilyData().compareTo(expTsColFam);
if (cmp == 0) {
....
}else if(cmp > 0){
//went past column family
//can only do this w/ col fam since it next field in sort
order after row
break;
}
rowIterator.next();
}
return true;
}
>
> I read the notes on the seek method and it does seem like using it would
> be more efficient since the only criteria for this filter is the expTs
> column family and thus only those RFiles would be opened, but I just can
> figure out where to start and my Googling hasn't yielded any examples yet.
>
>
>
> On Thu, Nov 7, 2013 at 3:16 PM, Keith Turner <[email protected]> wrote:
>
>>
>>
>>
>> On Thu, Nov 7, 2013 at 3:49 PM, Terry P. <[email protected]> wrote:
>>
>>> Hi Keith,
>>> No, expTs won't be the first actually -- that'll teach me to try things
>>> with overly simplistic data!
>>>
>>
>>> There will be 10-12 column families for each row. I take it my simple
>>> check for column family name isn't enough?
>>>
>>
>> You can iterate until you see the column or seek to it. If you expect
>> there will always be a small of data before the column occurs, then iterate.
>>
>>
>>>
>>>
>>> On Thursday, November 7, 2013, Keith Turner wrote:
>>>
>>>> Your accept row function assumes that expTs will be the first column in
>>>> the row, is this always the case?
>>>>
>>>>
>>>> On Wed, Nov 6, 2013 at 3:37 PM, Terry P. <[email protected]> wrote:
>>>>
>>>> Hi William, many thanks for the explanation of scan time versus
>>>> compaction time. I'll look through the classes again and note where the
>>>> remove versus suppress wordings are used and open a ticket.
>>>>
>>>> As mentioned, I only dabble in java, but regardless of that fact at
>>>> this point I'm the one that has to get this done. I've hobbled together my
>>>> first attempt, but I get the following error where I try to add it as a
>>>> scan iterator for testing:
>>>>
>>>> root@meta> setiter -class
>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
>>>> 20 -scan -t itertest
>>>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
>>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>>> not be initialized (Servers are unable to load
>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>>> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>>>>
>>>> Here's my source. Note that the value stored in the expTs ColFam is in
>>>> the format "yyyyMMddHHmmssS", which I convert to a long for a direct
>>>> comparison to System.currentTimeMillis(). I only overrode the init and
>>>> acceptRow methods, hoping the others would work as-is from the base class.
>>>>
>>>> One clarification: turns out expTs is the ColumnFamily, and the ingest
>>>> app does not assign a ColumnQualifier for expTs. So to amend my prior table
>>>> layout (including the datetime format):
>>>>
>>>>
>>>> Format: Key:CF:CQ:Value
>>>> abc:data:title:"My fantastic data"
>>>> abc:data:content:<bytedata>
>>>> abc:creTs::20130804171412445
>>>> abc:*expTs*::20131104171412445
>>>> ... 6-8 more columns of data per row ...
>>>>
>>>> where *expTs* is the ColumnFamily to determine if the entire row
>>>> should be removed based on whether its value is <= NOW. If a row has not
>>>> yet been assigned an expiration date, expTs will not be set and the
>>>> ColumnFamily will not yet be present. Seems like an odd choice to use
>>>> distinct Column Families, without Column Qualifiers, but that's how the
>>>> ingest app was done.
>>>>
>>>> I greatly appreciate any advice you can provide.
>>>>
>>>> package com.esa.accumulo.iterators;
>>>>
>>>> import java.io.IOException;
>>>> import java.text.ParseException;
>>>> import java.text.SimpleDateFormat;
>>>> import java.util.Date;
>>>> import java.util.Map;
>>>>
>>>> import org.apache.accumulo.core.data.Key;
>>>> import org.apache.accumulo.core.data.Value;
>>>> import org.apache.accumulo.core.iterators.IteratorEnvironment;
>>>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
>>>> import org.apache.accumulo.core.iterators.user.RowFilter;
>>>>
>>>> /**
>>>> * A filter that removes rows based on the column designated as the
>>>> "expiration timestamp" column family.
>>>> *
>>>> * It removes the row if the value in the expirationTimestamp column is
>>>> less than currentTime.
>>>> *
>>>> * TODO: The designation of the expirationTimestamp ColumnFamily and
>>>> its DateFormat is
>>>> * set in the iterator options when the iterator is applied to the
>>>> table. (For
>>>> * now it is hardcoded to match the format used in the Solr-Accumulo
>>>> plugin)
>>>> */
>>>> public class ExpirationTimestampPurgeFilter extends RowFilter {
>>>> private long currentTime;
>>>> // TODO: make accumuloDateFormat settable via Iterator Options
>>>> // Date Format for Expiration Timestamp ColumnFamily stored in
>>>> Accumulo
>>>> private String expTsDateFormat = "yyyyMMddHHmmssS";
>>>> SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>>>>
>>>> // TODO: make expTs settable via Iterator Options
>>>> // ColumnFamily containing Expiration Timestamp value (note ingest app
>>>> // did NOT assign a ColumnQualifier, only a ColumnFamily)
>>>> private String expTsColFam = "expTs";
>>>>
>>>> @Override
>>>> public boolean acceptRow(SortedKeyValueIterator<Key, Value>
>>>> rowIterator)
>>>> throws IOException {
>>>>
>>>> if
>>>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam))
>>>> {
>>>> Date expTsDate = null;
>>>> try {
>>>> expTsDate = df.parse(rowIterator.getTopValue().toString());
>>>> if (expTsDate.getTime() < currentTime)
>>>> return false;
>>>> } catch (ParseException e) {
>>>> // TODO Auto-generated catch block
>>>> e.printStackTrace();
>>>> }
>>>> }
>>>> return true;
>>>> }
>>>>
>>>> @Override
>>>> public void init(SortedKeyValueIterator<Key, Value> source,
>>>> Map<String, Str
>>>>
>>>>
>>
>