I don't want to be argumentative here, but by definition is's not an internal feature because it's part of the public API. We use versioning in a way that makes me somewhat uncomfortable, but it's been quite useful. I'd like to see a clear explanation of why it exists and what use cases it was intended to support.
Brian > Since you asked… > > Simplest answer… your schema should not rely upon internal features of the > system. Since you are tracking your data along the lines of a temporal > attribute it should be part of the schema. In terms of a good design, by > making it a part of the schema, you’re defining that the data has a temporal > property/attribute. > > Cell versioning is an internal feature of HBase. Its there for a reason. > Perhaps one of the committers should expand on why its there. (When I asked > this earlier, never got an answer. ) > > > Longer answer… review how HBase stores the rows, including the versions of > the cell. > You’re putting an unnecessary stress on the system. > > Its just not Zen… ;-) > > The reason I’m a bit short on this topic is that its an issue that keeps > coming up, over and over again because some idiot keeps looking to take a > shortcut without understanding the implications of their decision. Just like > salting the key. (Note: prepending a truncated hash isn’t the same as using > a salt. Salting has a specific meaning and the salt is orthogonal to the > underlying key. Any relationship between the salt and the key is purely > random luck.) > > Does that help? > (BTW, this should be part of any schema design talk… yet somehow I think its > not covered… ) > > -Mike > > PS. Its not weird that the cell versions are checked. It makes perfect sense. > > On Apr 12, 2014, at 2:55 PM, Guillermo Ortiz <konstt2...@gmail.com> wrote: > >> Well, It was just a example why I could keep a thousand versions or a cell. >> I didn't know that HBase was checking each version when I do a scan, it's a >> little weird when data is sorted. >> >> You get my attention with your comment, that it's better to store data over >> time with new columns that with versions. Why is it better? >> Versions looks that there're very convenient for that use case. So, does it >> work better a rowkey with 3600 columns, that a rowkey with a column with >> 3600 versions? What's the reason for avoiding a massive use of versions? >> >> >> 2014-04-12 15:07 GMT+02:00 Michael Segel <michael_se...@hotmail.com>: >> >>> Silly question... >>> >>> Why does the idea of using versioning to capture temporal changes to data >>> keep being propagated? >>> >>> Seriously this issue keeps popping up... >>> >>> If you want to capture data over time... use a timestamp as part of the >>> column name. Don't abuse the cell's version. >>> >>> >>> >>> On Apr 11, 2014, at 11:03 AM, gortiz <gor...@pragsis.com> wrote: >>> >>>> Yes, I have tried with two different values for that value of versions, >>> 1000 and maximum value for integers. >>>> >>>> But, I want to keep those versions. I don't want to keep just 3 >>> versions. Imagine that I want to record a new version each minute and store >>> a day, those are 1440 versions. >>>> >>>> Why is HBase going to read all the versions?? , I thought, if you don't >>> indicate any versions it's just read the newest and skip the rest. It >>> doesn't make too much sense to read all of them if data is sorted, plus the >>> newest version is stored in the top. >>>> >>>> >>>> On 11/04/14 11:54, Anoop John wrote: >>>>> What is the max version setting u have done for ur table cf? When u set >>>>> some a value, HBase has to keep all those versions. During a scan it >>> will >>>>> read all those versions. In 94 version the default value for the max >>>>> versions is 3. I guess you have set some bigger value. If u have not, >>>>> mind testing after a major compaction? >>>>> >>>>> -Anoop- >>>>> >>>>> On Fri, Apr 11, 2014 at 1:01 PM, gortiz <gor...@pragsis.com> wrote: >>>>> >>>>>> Last test I have done it's to reduce the number of versions to 100. >>>>>> So, right now, I have 100 rows with 100 versions each one. >>>>>> Times are: (I got the same times for blocksize of 64Ks and 1Mb) >>>>>> 100row-1000versions + blockcache-> 80s. >>>>>> 100row-1000versions + No blockcache-> 70s. >>>>>> >>>>>> 100row-*100*versions + blockcache-> 7.3s. >>>>>> 100row-*100*versions + No blockcache-> 6.1s. >>>>>> >>>>>> What's the reasons of this? I guess HBase is enough smart for not >>> consider >>>>>> old versions, so, it just checks the newest. But, I reduce 10 times the >>>>>> size (in versions) and I got a 10x of performance. >>>>>> >>>>>> The filter is scan 'filters', {FILTER => "ValueFilter(=, >>>>>> 'binary:5')",STARTROW => '1010000000000000000000000000000000000101', >>>>>> STOPROW => '6010000000000000000000000000000000000201'} >>>>>> >>>>>> >>>>>> >>>>>> On 11/04/14 09:04, gortiz wrote: >>>>>> >>>>>>> Well, I guessed that, what it doesn't make too much sense because >>> it's so >>>>>>> slow. I only have right now 100 rows with 1000 versions each row. >>>>>>> I have checked the size of the dataset and each row is about 700Kbytes >>>>>>> (around 7Gb, 100rowsx1000versions). So, it should only check 100 rows >>> x >>>>>>> 700Kbytes = 70Mb, since it just check the newest version. How can it >>> spend >>>>>>> too many time checking this quantity of data? >>>>>>> >>>>>>> I'm generating again the dataset with a bigger blocksize (previously >>> was >>>>>>> 64Kb, now, it's going to be 1Mb). I could try tunning the scanning and >>>>>>> baching parameters, but I don't think they're going to affect too >>> much. >>>>>>> >>>>>>> Another test I want to do, it's generate the same dataset with just >>>>>>> 100versions, It should spend around the same time, right? Or am I >>> wrong? >>>>>>> >>>>>>> On 10/04/14 18:08, Ted Yu wrote: >>>>>>> >>>>>>>> It should be newest version of each value. >>>>>>>> >>>>>>>> Cheers >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Apr 10, 2014 at 9:55 AM, gortiz <gor...@pragsis.com> wrote: >>>>>>>> >>>>>>>> Another little question is, when the filter I'm using, Do I check >>> all the >>>>>>>>> versions? or just the newest? Because, I'm wondering if when I do a >>> scan >>>>>>>>> over all the table, I look for the value "5" in all the dataset or >>> I'm >>>>>>>>> just >>>>>>>>> looking for in one newest version of each value. >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/04/14 16:52, gortiz wrote: >>>>>>>>> >>>>>>>>> I was trying to check the behaviour of HBase. The cluster is a >>> group of >>>>>>>>>> old computers, one master, five slaves, each one with 2Gb, so, >>> 12gb in >>>>>>>>>> total. >>>>>>>>>> The table has a column family with 1000 columns and each column >>> with >>>>>>>>>> 100 >>>>>>>>>> versions. >>>>>>>>>> There's another column faimily with four columns an one image of >>> 100kb. >>>>>>>>>> (I've tried without this column family as well.) >>>>>>>>>> The table is partitioned manually in all the slaves, so data are >>>>>>>>>> balanced >>>>>>>>>> in the cluster. >>>>>>>>>> >>>>>>>>>> I'm executing this sentence *scan 'table1', {FILTER => >>> "ValueFilter(=, >>>>>>>>>> 'binary:5')"* in HBase 0.94.6 >>>>>>>>>> My time for lease and rpc is three minutes. >>>>>>>>>> Since, it's a full scan of the table, I have been playing with the >>>>>>>>>> BLOCKCACHE as well (just disable and enable, not about the size of >>>>>>>>>> it). I >>>>>>>>>> thought that it was going to have too much calls to the GC. I'm not >>>>>>>>>> sure >>>>>>>>>> about this point. >>>>>>>>>> >>>>>>>>>> I know that it's not the best way to use HBase, it's just a test. I >>>>>>>>>> think >>>>>>>>>> that it's not working because the hardware isn't enough, although, >>> I >>>>>>>>>> would >>>>>>>>>> like to try some kind of tunning to improve it. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 10/04/14 14:21, Ted Yu wrote: >>>>>>>>>> >>>>>>>>>> Can you give us a bit more information: >>>>>>>>>>> HBase release you're running >>>>>>>>>>> What filters are used for the scan >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz <gor...@pragsis.com> wrote: >>>>>>>>>>> >>>>>>>>>>> I got this error when I execute a full scan with filters about a >>>>>>>>>>> table. >>>>>>>>>>> >>>>>>>>>>>> Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase. >>>>>>>>>>>> regionserver.LeaseException: >>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease >>>>>>>>>>>> '-4165751462641113359' does not exist >>>>>>>>>>>> at >>> org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> at org.apache.hadoop.hbase.regionserver.HRegionServer. >>>>>>>>>>>> next(HRegionServer.java:2482) >>>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> Method) >>>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke( >>>>>>>>>>>> NativeMethodAccessorImpl.java:39) >>>>>>>>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke( >>>>>>>>>>>> DelegatingMethodAccessorImpl.java:25) >>>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>>>>>>> at >>> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call( >>>>>>>>>>>> WritableRpcEngine.java:320) >>>>>>>>>>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run( >>>>>>>>>>>> HBaseServer.java:1428) >>>>>>>>>>>> >>>>>>>>>>>> I have read about increase the lease time and rpc time, but it's >>> not >>>>>>>>>>>> working.. what else could I try?? The table isn't too big. I have >>>>>>>>>>>> been >>>>>>>>>>>> checking the logs from GC, HMaster and some RegionServers and I >>>>>>>>>>>> didn't see >>>>>>>>>>>> anything weird. I tried as well to try with a couple of caching >>>>>>>>>>>> values. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> *Guillermo Ortiz* >>>>>>>>> /Big Data Developer/ >>>>>>>>> >>>>>>>>> Telf.: +34 917 680 490< >>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>>>>>> Fax: +34 913 833 301< >>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain >>>>>>>>> >>>>>>>>> _http://www.bidoop.es_ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> -- >>>>>> *Guillermo Ortiz* >>>>>> /Big Data Developer/ >>>>>> >>>>>> Telf.: +34 917 680 490< >>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>>> Fax: +34 913 833 301< >>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html# >>>> >>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain >>>>>> >>>>>> _http://www.bidoop.es_ >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> *Guillermo Ortiz* >>>> /Big Data Developer/ >>>> >>>> Telf.: +34 917 680 490 >>>> Fax: +34 913 833 301 >>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain >>>> >>>> _http://www.bidoop.es_ >>>> >>> >>> > >