Re: Scan vs Filter performance.

Keith Turner Thu, 01 Oct 2015 08:09:47 -0700

I think you are missing the last one because next() calls super.next() at
the end AND your has hasTop() calls super.hasTop()


On Tue, Sep 29, 2015 at 3:45 PM, Moises Baly <[email protected]> wrote:

> Hi there,
>
> I'm writing a custom iterator, which essentially is obtaining a range of
> values using a slightly different way to compare the rows (for keeping in
> range). In one test, it should return every row in Accumulo, but it's
> missing the last one. The most important parts of the code would look like
> this:
>
> class CIterator extends WrappingIterator() {
>   private var emitKey: Key = _
>   private var emitValue: Value = _
>
>   override def deepCopy(env: IteratorEnvironment): 
> SortedKeyValueIterator[Key, Value] = {
>     new CIterator(this, env)
>   }
>
>
>   def this(_this: CIterator, env: IteratorEnvironment) = {
>     this()
>     setSource(_this.getSource.deepCopy(env))
>   }
>
>   override def init(source: SortedKeyValueIterator[Key, Value], options: 
> util.Map[String, String], env: IteratorEnvironment) = {
>     super.init(source, options, env)
>   }
>
>   override def getTopKey(): Key = {
>     emitKey
>   }
>
>   override def getTopValue(): Value = {
>     emitValue
>   }
>
>   override def hasTop(): Boolean = {
>     super.hasTop
>   }
>
>   override def seek(range: Range, columnFamilies: 
> util.Collection[ByteSequence], inclusive: Boolean): Unit = {
>
>     ...
>
>     val seekRange = new Range(partialKeyStart.toString, true, 
> partialKeyEnd.toString, true)
>
>     super.seek(seekRange, columnFamilies, inclusive);
>
>     if (super.hasTop()) {
>       next();
>     }
>   }
>
>   override def next(): Unit = {
>     ...
>     val lowerBoundCheck = rangeStart.compareTo(nextKey.getRow.toString)
>     val upperBoundCheck = rangeEnd.compareTo(nextKey.getRow.toString)
>     if (lowerBoundCheck <= 0 && upperBoundCheck >= 0){
>       emitKey = new Key(nextKey)
>       emitValue = new Value(nextValue)
>       if (super.hasTop()){
>         super.next()
>       }
>
>     }
>   }
> }
>
>
> So that code, if I have a range that comprises every row, returns every one 
> of them but the last one. A high level call list would look like this:
>
> Seek ->
> Next ->
> hasTop ->
> Top key ->
> Top key ->
> Top value ->
> hasTop ->
> Top key ->
> Top value ->
> Next ->
> hasTop ->
> Top key ->
> Top key ->
> Top value ->
> (print row - value 1) -> hasTop ->
> Top key ->
> Top value ->
> Next ->
> hasTop ->
> Top key ->
> Top key ->
> Top value -> (print row - value 2) -> hasTop ->
> Top key ->
> Top value ->
> Next ->
> hasTop -> (print row - value 3) ->  hasTop ->
>
> I think I'm missing something on the call tree:
>
> 1- Is it normal to have many subsequent topKey() calls after next()?
>
> 2- This is supposed to give me every row (the condition put in place for the 
> range is working), but as you can see, it stops after the last next() call, 
> for some reason (maybe something to do with the interfaces hierarchy?)
>
> 3- In general, what would be a correct approach (execution path) for building 
> a custom iterator? I'm still hesitant on how the iterator functions (next, 
> seek, getTop...) interact with each other, specially in the way we give back 
> results to clients.
>
> Thank you for your time,
>
>
> Moises
>
>
>
> On Tue, Sep 29, 2015 at 11:16 AM, Keith Turner <[email protected]> wrote:
>
>>
>>
>> On Tue, Sep 29, 2015 at 12:59 AM, mohit.kaushik <[email protected]
>> > wrote:
>>
>>> Hi Keith,
>>>
>>> When we fetch a column or column family Ii seems, it does not seek and
>>> only scan by filtering the key/value pairs. But as you said if I design a
>>> custom iterator to fetch a column family, It may work faster.
>>>
>>
>> When column families are fetched, Accumulo will seek[1].  It tries to
>> read 10 cells and then seeks.
>>
>> When fetching family and qualifier, two iterators are used.  The
>> ColumnFamilySkippingIterator and ColumnQualifierFilter.  The
>> ColumnQualifierFilter does a scan of all qualifers within a family [2].
>> The system configures the qualifier filter to have the family skipping iter
>> as a source[3], so it could still seek between families.
>>
>>
>>>
>>> But I want to know what would be the scenario if I define a locality
>>> group for the column family and run the same custom iterator on it which
>>> scan and seeks both? what would be he impact on performance (gain or loss)?
>>>
>>
>> Like Josh said, it really depends on your situation. Its hard to offer an
>> opinion w/o knowing more about the schema and the queries.
>>
>> Below I expanded on what Josh mentioned.
>>
>> If you have a locality group, it can really help in the case where you
>> have many rows that have a few families.  For example if you have 10^7 rows
>> in a tablet and only 10^3 have a certain column family thats in a locality
>> group, it can make it very fast to find those 1000 rows.  W/o a locality
>> group even w/ seeking, you would still be seeking to each row.
>>
>> Conversely if you have 10^2 rows in a tablet, each having many families.
>> If there is a column family you are interested in that only exist in 10
>> rows, you will still need to seek for each row to find it but ~100 seeks is
>> not so bad.
>>
>>
>>
>> [1]:
>> https://github.com/apache/accumulo/blob/1.6.3/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java#L65
>> [2]:
>> https://github.com/apache/accumulo/blob/1.6.3/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java#L54
>> [3]:
>> https://github.com/apache/accumulo/blob/1.6.3/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L2005
>>
>>
>>>
>>> Thanks
>>> Mohit Kaushik
>>>
>>>
>>> On 09/28/2015 10:49 PM, Moises Baly wrote:
>>>
>>> Hi Keith,
>>>
>>> No I wasn't aware of that. So I'll move forward with the custom
>>> iterator.
>>>
>>> Thank you for your time,
>>>
>>> Moises
>>>
>>> On Mon, Sep 28, 2015 at 12:35 PM, Keith Turner <[email protected]> wrote:
>>>
>>>> On Mon, Sep 28, 2015 at 12:19 PM, Moises Baly <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all:
>>>>>
>>>>> I would like to perform a range scan on a table, tweaking the
>>>>> definition of what goes into a particular key range. One way I can think 
>>>>> of
>>>>> is writing a filter on the key, and that would work fine. But I think it
>>>>> would be slow compared to a scan / seek custom iterator. How does the
>>>>> underlying login works? Does Filter goes through all records, or since is
>>>>> sorted follows the same underlying logic as a scan? Would a custom 
>>>>> iterator
>>>>> perform better?
>>>>>
>>>>
>>>> Yes, filter will read all data.  Custom iterator that seeks may be
>>>> faster.
>>>>
>>>> Are you aware of the following?
>>>>
>>>> https://issues.apache.org/jira/browse/ACCUMULO-3961
>>>> https://github.com/apache/accumulo/pull/42
>>>>
>>>>
>>>>>
>>>>> Thank you for your time,
>>>>>
>>>>> Moises
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> * Mohit Kaushik*
>>> Software Engineer
>>> A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
>>> *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553
>>>
>>> <http://politicomapper.orkash.com>interactive social intelligence at
>>> work...
>>>
>>> <https://www.facebook.com/Orkash2012>
>>> <http://www.linkedin.com/company/orkash-services-private-limited>
>>> <https://twitter.com/Orkash>  <http://www.orkash.com/blog/>
>>> <http://www.orkash.com>
>>> <http://www.orkash.com> ... ensuring Assurance in complexity and
>>> uncertainty
>>>
>>> *This message including the attachments, if any, is a confidential
>>> business communication. If you are not the intended recipient it may be
>>> unlawful for you to read, copy, distribute, disclose or otherwise use the
>>> information in this e-mail. If you have received it in error or are not the
>>> intended recipient, please destroy it and notify the sender immediately.
>>> Thank you *
>>>
>>
>>
>

Re: Scan vs Filter performance.

Reply via email to