In your previous example:
scan 'table1', {FILTER => "ValueFilter(=, 'binary:5')"}
there was no expression w.r.t. timestamp. See the following javadoc from
Scan.java:
* To only retrieve columns within a specific range of version timestamps,
* execute {@link #setTimeRange(long, long) setTimeRange}.
* <p>
* To only retrieve columns with a specific timestamp, execute
* {@link #setTimeStamp(long) setTimestamp}.
You can use one of the above methods to make your scan more selective.
ValueFilter#filterKeyValue(Cell) doesn't utilize advanced feature of
ReturnCode. You can refer to:
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.ReturnCode.html
You can take a look at SingleColumnValueFilter#filterKeyValue() for example
of how various ReturnCode's are used to speed up scan.
Cheers
On Fri, Apr 11, 2014 at 8:40 AM, Guillermo Ortiz <[email protected]>wrote:
> I read something interesting about it in HBase TDG.
>
> Page 344:
> The StoreScanner class combines the store files and memstore that the
> Store instance
> contains. It is also where the exclusion happens, based on the Bloom
> filter, or the timestamp. If you are asking for versions that are not more
> than 30 minutes old, for example, you can skip all storage files that are
> older than one hour: they will not contain anything of interest. See "Key
> Design" on page 357 for details on the exclusion, and how to make use of
> it.
>
> So, I guess that it doesn't have to read all the HFiles?? But, I don't know
> if HBase really uses the timestamp of each row or the date of the file. I
> guess when I execute the scan, it reads everything, but, I don't know why.
> I think there's something else that I don't see so that everything works to
> me.
>
>
> 2014-04-11 13:05 GMT+02:00 gortiz <[email protected]>:
>
> > Sorry, I didn't get it why it should read all the timestamps and not just
> > the newest it they're sorted and you didn't specific any timestamp in
> your
> > filter.
> >
> >
> >
> > On 11/04/14 12:13, Anoop John wrote:
> >
> >> In the storage layer (HFiles in HDFS) all versions of a particular cell
> >> will be staying together. (Yes it has to be lexicographically ordered
> >> KVs). So during a scan we will have to read all the version data. At
> this
> >> storage layer it doesn't know the versions stuff etc.
> >>
> >> -Anoop-
> >>
> >> On Fri, Apr 11, 2014 at 3:33 PM, gortiz <[email protected]> wrote:
> >>
> >> Yes, I have tried with two different values for that value of versions,
> >>> 1000 and maximum value for integers.
> >>>
> >>> But, I want to keep those versions. I don't want to keep just 3
> versions.
> >>> Imagine that I want to record a new version each minute and store a
> day,
> >>> those are 1440 versions.
> >>>
> >>> Why is HBase going to read all the versions?? , I thought, if you don't
> >>> indicate any versions it's just read the newest and skip the rest. It
> >>> doesn't make too much sense to read all of them if data is sorted, plus
> >>> the
> >>> newest version is stored in the top.
> >>>
> >>>
> >>>
> >>> On 11/04/14 11:54, Anoop John wrote:
> >>>
> >>> What is the max version setting u have done for ur table cf? When u
> >>>> set
> >>>> some a value, HBase has to keep all those versions. During a scan it
> >>>> will
> >>>> read all those versions. In 94 version the default value for the max
> >>>> versions is 3. I guess you have set some bigger value. If u have
> not,
> >>>> mind testing after a major compaction?
> >>>>
> >>>> -Anoop-
> >>>>
> >>>> On Fri, Apr 11, 2014 at 1:01 PM, gortiz <[email protected]> wrote:
> >>>>
> >>>> Last test I have done it's to reduce the number of versions to 100.
> >>>>
> >>>>> So, right now, I have 100 rows with 100 versions each one.
> >>>>> Times are: (I got the same times for blocksize of 64Ks and 1Mb)
> >>>>> 100row-1000versions + blockcache-> 80s.
> >>>>> 100row-1000versions + No blockcache-> 70s.
> >>>>>
> >>>>> 100row-*100*versions + blockcache-> 7.3s.
> >>>>> 100row-*100*versions + No blockcache-> 6.1s.
> >>>>>
> >>>>> What's the reasons of this? I guess HBase is enough smart for not
> >>>>> consider
> >>>>> old versions, so, it just checks the newest. But, I reduce 10 times
> the
> >>>>> size (in versions) and I got a 10x of performance.
> >>>>>
> >>>>> The filter is scan 'filters', {FILTER => "ValueFilter(=,
> >>>>> 'binary:5')",STARTROW => '1010000000000000000000000000000000000101',
> >>>>> STOPROW => '6010000000000000000000000000000000000201'}
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 11/04/14 09:04, gortiz wrote:
> >>>>>
> >>>>> Well, I guessed that, what it doesn't make too much sense because
> >>>>> it's
> >>>>>
> >>>>>> so
> >>>>>> slow. I only have right now 100 rows with 1000 versions each row.
> >>>>>> I have checked the size of the dataset and each row is about
> 700Kbytes
> >>>>>> (around 7Gb, 100rowsx1000versions). So, it should only check 100
> rows
> >>>>>> x
> >>>>>> 700Kbytes = 70Mb, since it just check the newest version. How can it
> >>>>>> spend
> >>>>>> too many time checking this quantity of data?
> >>>>>>
> >>>>>> I'm generating again the dataset with a bigger blocksize (previously
> >>>>>> was
> >>>>>> 64Kb, now, it's going to be 1Mb). I could try tunning the scanning
> and
> >>>>>> baching parameters, but I don't think they're going to affect too
> >>>>>> much.
> >>>>>>
> >>>>>> Another test I want to do, it's generate the same dataset with just
> >>>>>> 100versions, It should spend around the same time, right? Or am I
> >>>>>> wrong?
> >>>>>>
> >>>>>> On 10/04/14 18:08, Ted Yu wrote:
> >>>>>>
> >>>>>> It should be newest version of each value.
> >>>>>>
> >>>>>>> Cheers
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 10, 2014 at 9:55 AM, gortiz <[email protected]>
> wrote:
> >>>>>>>
> >>>>>>> Another little question is, when the filter I'm using, Do I check
> all
> >>>>>>> the
> >>>>>>>
> >>>>>>> versions? or just the newest? Because, I'm wondering if when I
> do
> >>>>>>>> a
> >>>>>>>> scan
> >>>>>>>> over all the table, I look for the value "5" in all the dataset or
> >>>>>>>> I'm
> >>>>>>>> just
> >>>>>>>> looking for in one newest version of each value.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 10/04/14 16:52, gortiz wrote:
> >>>>>>>>
> >>>>>>>> I was trying to check the behaviour of HBase. The cluster is a
> group
> >>>>>>>> of
> >>>>>>>>
> >>>>>>>> old computers, one master, five slaves, each one with 2Gb, so,
> 12gb
> >>>>>>>>> in
> >>>>>>>>> total.
> >>>>>>>>> The table has a column family with 1000 columns and each column
> >>>>>>>>> with
> >>>>>>>>> 100
> >>>>>>>>> versions.
> >>>>>>>>> There's another column faimily with four columns an one image of
> >>>>>>>>> 100kb.
> >>>>>>>>> (I've tried without this column family as well.)
> >>>>>>>>> The table is partitioned manually in all the slaves, so data are
> >>>>>>>>> balanced
> >>>>>>>>> in the cluster.
> >>>>>>>>>
> >>>>>>>>> I'm executing this sentence *scan 'table1', {FILTER =>
> >>>>>>>>> "ValueFilter(=,
> >>>>>>>>> 'binary:5')"* in HBase 0.94.6
> >>>>>>>>> My time for lease and rpc is three minutes.
> >>>>>>>>> Since, it's a full scan of the table, I have been playing with
> the
> >>>>>>>>> BLOCKCACHE as well (just disable and enable, not about the size
> of
> >>>>>>>>> it). I
> >>>>>>>>> thought that it was going to have too much calls to the GC. I'm
> not
> >>>>>>>>> sure
> >>>>>>>>> about this point.
> >>>>>>>>>
> >>>>>>>>> I know that it's not the best way to use HBase, it's just a
> test. I
> >>>>>>>>> think
> >>>>>>>>> that it's not working because the hardware isn't enough,
> although,
> >>>>>>>>> I
> >>>>>>>>> would
> >>>>>>>>> like to try some kind of tunning to improve it.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 10/04/14 14:21, Ted Yu wrote:
> >>>>>>>>>
> >>>>>>>>> Can you give us a bit more information:
> >>>>>>>>>
> >>>>>>>>> HBase release you're running
> >>>>>>>>>> What filters are used for the scan
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I got this error when I execute a full scan with filters
> >>>>>>>>>> about a
> >>>>>>>>>> table.
> >>>>>>>>>>
> >>>>>>>>>> Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.
> >>>>>>>>>>
> >>>>>>>>>>> regionserver.LeaseException:
> >>>>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease
> >>>>>>>>>>> '-4165751462641113359' does not exist
> >>>>>>>>>>> at org.apache.hadoop.hbase.regionserver.Leases.
> >>>>>>>>>>> removeLease(Leases.java:231)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> at org.apache.hadoop.hbase.regionserver.HRegionServer.
> >>>>>>>>>>> next(HRegionServer.java:2482)
> >>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> >>>>>>>>>>> Method)
> >>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(
> >>>>>>>>>>> NativeMethodAccessorImpl.java:39)
> >>>>>>>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> >>>>>>>>>>> DelegatingMethodAccessorImpl.java:25)
> >>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>>>>>>>> at org.apache.hadoop.hbase.ipc.
> >>>>>>>>>>> WritableRpcEngine$Server.call(
> >>>>>>>>>>> WritableRpcEngine.java:320)
> >>>>>>>>>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(
> >>>>>>>>>>> HBaseServer.java:1428)
> >>>>>>>>>>>
> >>>>>>>>>>> I have read about increase the lease time and rpc time, but
> it's
> >>>>>>>>>>> not
> >>>>>>>>>>> working.. what else could I try?? The table isn't too big. I
> have
> >>>>>>>>>>> been
> >>>>>>>>>>> checking the logs from GC, HMaster and some RegionServers and I
> >>>>>>>>>>> didn't see
> >>>>>>>>>>> anything weird. I tried as well to try with a couple of caching
> >>>>>>>>>>> values.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>>
> >>>>>>>>>> *Guillermo Ortiz*
> >>>>>>>> /Big Data Developer/
> >>>>>>>>
> >>>>>>>> Telf.: +34 917 680 490<https://mail.google.com/
> >>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
> >>>>>>>> Fax: +34 913 833 301<https://mail.google.com/
> >>>>>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
> >>>>>>>>
> >>>>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> >>>>>>>>
> >>>>>>>> _http://www.bidoop.es_
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>> *Guillermo Ortiz*
> >>>>> /Big Data Developer/
> >>>>>
> >>>>> Telf.: +34 917 680 490<https://mail.google.com/
> >>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
> >>>>> Fax: +34 913 833 301<https://mail.google.com/
> >>>>> mail/u/0/html/compose/static_files/blank_quirks.html#>
> >>>>>
> >>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> >>>>>
> >>>>> _http://www.bidoop.es_
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>> *Guillermo Ortiz*
> >>> /Big Data Developer/
> >>>
> >>> Telf.: +34 917 680 490<https://mail.google.com/mail/
> >>> u/0/html/compose/static_files/blank_quirks.html#>
> >>> Fax: +34 913 833 301<https://mail.google.com/mail/
> >>> u/0/html/compose/static_files/blank_quirks.html#>
> >>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> >>>
> >>> _http://www.bidoop.es_
> >>>
> >>>
> >>>
> >
> > --
> > *Guillermo Ortiz*
> > /Big Data Developer/
> >
> > Telf.: +34 917 680 490
> > Fax: +34 913 833 301
> > C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
> >
> > _http://www.bidoop.es_
> >
> >
>