Re: How to remove entire row at the server side?

Keith Turner Fri, 08 Nov 2013 06:50:47 -0800

On Fri, Nov 8, 2013 at 12:26 AM, Terry P. <[email protected]> wrote:


> Hi Keith,
> Given that what I need to filter on is only the expTs column family, would
> it be faster to seek?  I don't know how to seek, but I also can't figure
> out how to iterate inside the acceptRow method -- there's no scanner as I
> normally use when reading and iterating over key/values.
>

I think in your case you should iterate,  because you are only advancing a
few keys.   Some of the iterators in Accumulo try to iterator for a few
keys, and if something is not found in 10 keys then seek.

Could do something like the following.


  ByteSequence expTsColFam = new ArrayByteSequence("expTs");

@Override
  public boolean acceptRow(SortedKeyValueIterator<Key, Value>
rowIterator)  throws IOException {
       while(rowIterator.hasTop()){
            //this is faster becaue it compares against byte arrays in key
directly w/o converting to string
            int cmp
= rowIterator.getTopKey().getColumnFamilyData().compareTo(expTsColFam);
            if (cmp == 0) {
                 ....
            }else if(cmp > 0){
                 //went past column family
                 //can only do this w/ col fam since it next field in sort
order after row
                 break;
            }
            rowIterator.next();
       }

       return true;
  }


>
> I read the notes on the seek method and it does seem like using it would
> be more efficient since the only criteria for this filter is the expTs
> column family and thus only those RFiles would be opened, but I just can
> figure out where to start and my Googling hasn't yielded any examples yet.
>
>
>
> On Thu, Nov 7, 2013 at 3:16 PM, Keith Turner <[email protected]> wrote:
>
>>
>>
>>
>> On Thu, Nov 7, 2013 at 3:49 PM, Terry P. <[email protected]> wrote:
>>
>>> Hi Keith,
>>> No, expTs won't be the first actually -- that'll teach me to try things
>>> with overly simplistic data!
>>>
>>
>>>  There will be 10-12 column families for each row. I take it my simple
>>> check for column family name isn't enough?
>>>
>>
>> You can iterate until you see the column or seek to it.   If you expect
>> there will always be a small of data before the column occurs, then iterate.
>>
>>
>>>
>>>
>>> On Thursday, November 7, 2013, Keith Turner wrote:
>>>
>>>> Your accept row function assumes that expTs will be the first column in
>>>> the row, is this always the case?
>>>>
>>>>
>>>> On Wed, Nov 6, 2013 at 3:37 PM, Terry P. <[email protected]> wrote:
>>>>
>>>> Hi William, many thanks for the explanation of scan time versus
>>>> compaction time. I'll look through the classes again and note where the
>>>> remove versus suppress wordings are used and open a ticket.
>>>>
>>>> As mentioned, I only dabble in java, but regardless of that fact at
>>>> this point I'm the one that has to get this done. I've hobbled together my
>>>> first attempt, but I get the following error where I try to add it as a
>>>> scan iterator for testing:
>>>>
>>>> root@meta> setiter -class
>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
>>>> 20 -scan -t itertest
>>>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
>>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>>> not be initialized (Servers are unable to load
>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>>> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>>>>
>>>> Here's my source.  Note that the value stored in the expTs ColFam is in
>>>> the format "yyyyMMddHHmmssS", which I convert to a long for a direct
>>>> comparison to System.currentTimeMillis(). I only overrode the init and
>>>> acceptRow methods, hoping the others would work as-is from the base class.
>>>>
>>>> One clarification: turns out expTs is the ColumnFamily, and the ingest
>>>> app does not assign a ColumnQualifier for expTs. So to amend my prior table
>>>> layout (including the datetime format):
>>>>
>>>>
>>>> Format: Key:CF:CQ:Value
>>>> abc:data:title:"My fantastic data"
>>>> abc:data:content:<bytedata>
>>>> abc:creTs::20130804171412445
>>>> abc:*expTs*::20131104171412445
>>>> ... 6-8 more columns of data per row ...
>>>>
>>>> where *expTs* is the ColumnFamily to determine if the entire row
>>>> should be removed based on whether its value is <= NOW.  If a row has not
>>>> yet been assigned an expiration date, expTs will not be set and the
>>>> ColumnFamily will not yet be present.  Seems like an odd choice to use
>>>> distinct Column Families, without Column Qualifiers, but that's how the
>>>> ingest app was done.
>>>>
>>>> I greatly appreciate any advice you can provide.
>>>>
>>>> package com.esa.accumulo.iterators;
>>>>
>>>> import java.io.IOException;
>>>> import java.text.ParseException;
>>>> import java.text.SimpleDateFormat;
>>>> import java.util.Date;
>>>> import java.util.Map;
>>>>
>>>> import org.apache.accumulo.core.data.Key;
>>>> import org.apache.accumulo.core.data.Value;
>>>> import org.apache.accumulo.core.iterators.IteratorEnvironment;
>>>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
>>>> import org.apache.accumulo.core.iterators.user.RowFilter;
>>>>
>>>> /**
>>>>  * A filter that removes rows based on the column designated as the
>>>> "expiration timestamp" column family.
>>>>  *
>>>>  * It removes the row if the value in the expirationTimestamp column is
>>>> less than currentTime.
>>>>  *
>>>>  * TODO: The designation of the expirationTimestamp ColumnFamily and
>>>> its DateFormat is
>>>>  * set in the iterator options when the iterator is applied to the
>>>> table. (For
>>>>  * now it is hardcoded to match the format used in the Solr-Accumulo
>>>> plugin)
>>>>  */
>>>> public class ExpirationTimestampPurgeFilter extends RowFilter {
>>>>   private long currentTime;
>>>>   // TODO: make accumuloDateFormat settable via Iterator Options
>>>>   // Date Format for Expiration Timestamp ColumnFamily stored in
>>>> Accumulo
>>>>   private String expTsDateFormat = "yyyyMMddHHmmssS";
>>>>   SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>>>>
>>>>   // TODO: make expTs settable via Iterator Options
>>>>   // ColumnFamily containing Expiration Timestamp value (note ingest app
>>>>   // did NOT assign a ColumnQualifier, only a ColumnFamily)
>>>>   private String expTsColFam = "expTs";
>>>>
>>>>   @Override
>>>>   public boolean acceptRow(SortedKeyValueIterator<Key, Value>
>>>> rowIterator)
>>>>     throws IOException {
>>>>
>>>>     if
>>>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam)) 
>>>> {
>>>>        Date expTsDate = null;
>>>>        try {
>>>>          expTsDate = df.parse(rowIterator.getTopValue().toString());
>>>>            if (expTsDate.getTime() < currentTime)
>>>>              return false;
>>>>        } catch (ParseException e) {
>>>>          // TODO Auto-generated catch block
>>>>          e.printStackTrace();
>>>>        }
>>>>     }
>>>>     return true;
>>>>   }
>>>>
>>>>   @Override
>>>>   public void init(SortedKeyValueIterator<Key, Value> source,
>>>>       Map<String, Str
>>>>
>>>>
>>
>

Re: How to remove entire row at the server side?

Reply via email to