Re: Scan vs Get

Jean-Marc Spaggiari Wed, 20 May 2015 06:03:51 -0700

Ok. I found a clean way to improve that a lot without going with the
filter. I will open a JIRA and push a fix.


The idea is to set the caching to the maximum of LIMIT, so we don't read
the entire table before returning to the shell. Also, we have to change
where we do the test.

anyway. JIRA 13721 is opened, I wlil push something there today.

Thanks,

JM

2015-05-19 23:51 GMT-04:00 Ted Yu <[email protected]>:

> For PageFilter :
>
>  * Implementation of Filter interface that limits results to a specific
> page
>
>  * size. It terminates scanning once the number of filter-passed rows is >
>
>  * the given page size.
>
> In your case, what should be the page size - 0 ?
>
> Cheers
>
> On Tue, May 19, 2015 at 8:30 PM, Jean-Marc Spaggiari <
> [email protected]> wrote:
>
> > Oh, I see! So basically we do a full table scan because it never returns
> a
> > 2nd row, so we never reach that break and we exit only when we reach the
> > end of the table. Therefore the same performances without the limit
> > parameter...
> >
> > Should we then try to add a filter like PageFilter to the scan if we
> have a
> > LIMIT? At least that might avoid a full scan?
> >
> > 2015-05-19 23:14 GMT-04:00 Matteo Bertozzi <[email protected]>:
> >
> > > Take a look at table.rb _scan_internal()
> > > LIMIT is not passed to the server, so you fetch more rows
> > >
> > >
> >
> https://github.com/apache/hbase/blob/master/hbase-shell/src/main/ruby/hbase/table.rb#L495
> > >
> > > Matteo
> > >
> > >
> > > On Tue, May 19, 2015 at 8:11 PM, Jean-Marc Spaggiari <
> > > [email protected]> wrote:
> > >
> > > > I tried to run scan/get/scan/get many times, and always the same
> > pattern.
> > > > You can remove the "LIMIT => 1" parameter and you will get the same
> > > > performances.
> > > >
> > > > Scan and get without the QC returns in very similar time. 191ms for
> > one,
> > > > 194ms for the other one.
> > > >
> > > > 2015-05-19 23:02 GMT-04:00 Ted Yu <[email protected]>:
> > > >
> > > > > J-M:
> > > > > How many times did you try the pair of queries ?
> > > > >
> > > > > Since scan was run first, this would give the get query some
> > advantage,
> > > > > right ?
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Tue, May 19, 2015 at 7:34 PM, Jean-Marc Spaggiari <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Are not Scan and Gets supposed to be almost as fast?
> > > > > >
> > > > > > I have a pretty small table with 65K lines, few columns
> (hundred?)
> > > > trying
> > > > > > to go a get and a scan.
> > > > > >
> > > > > > hbase(main):009:0> scan 'sensors', { COLUMNS =>
> > > > > > ['v:f92acb5b-079a-42bc-913a-657f270a3dc1'], STARTROW => '000a',
> > LIMIT
> > > > =>
> > > > > 1
> > > > > > }
> > > > > > ROW
> > > > > > COLUMN+CELL
> > > > > >
> > > > > >  000a
> > > > > > column=v:f92acb5b-079a-42bc-913a-657f270a3dc1,
> > > timestamp=1432088038576,
> > > > > >
> > > > >
> > > >
> > >
> >
> value=\x08000aHf92acb5b-079a-42bc-913a-657f270a3dc1\x0EFAILURE\x0CNE-858\x
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 140-0000-000\x02\x96\x01SXOAXTPSIUFPPNUCIEVQGCIZHCEJBKGWINHKIHFRHWHNATAHAHQBFRAYLOAMQEGKLNZIFM
> > > > > > 000a
> > > > > > 1 row(s) in 12.6720 seconds
> > > > > >
> > > > > > hbase(main):010:0> get 'sensors', '000a', {COLUMN =>
> > > > > > 'v:f92acb5b-079a-42bc-913a-657f270a3dc1'}
> > > > > > COLUMN
> > > > > > CELL
> > > > > >
> > > > > >  v:f92acb5b-079a-42bc-913a-657f270a3dc1
> > > > > timestamp=1432088038576,
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> value=\x08000aHf92acb5b-079a-42bc-913a-657f270a3dc1\x0EFAILURE\x0CNE-858\x140-0000-000\x02\x96\x01SXOAXTPSIUFPPNUCIEVQGCI
> > > > > >
> > > > > > ZHCEJBKGWINHKIHFRHWHNATAHAHQBFRAYLOAMQEGKLNZIFM
> > > > > > 000a
> > > > > >
> > > > > > 1 row(s) in 0.0280 seconds
> > > > > >
> > > > > >
> > > > > > They both return the same result. However, the get returns in
> 28ms
> > > > while
> > > > > > the scan returns in 12672ms.
> > > > > >
> > > > > > How come can the scan be that slow? Is it normal? If I remove the
> > QC
> > > > from
> > > > > > the scan, then it takes only 250ms to return all the columns. I
> > think
> > > > > > something is not correct.
> > > > > >
> > > > > > I'm running on 1.0.0-cdh5.4.0 so I guess it's the same for
> 1.0.x...
> > > > > >
> > > > > > JM
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Scan vs Get

Reply via email to