I don't want to be argumentative here, but by definition is's not an internal 
feature because it's part of the
public API. We use versioning in a way that makes me somewhat uncomfortable, 
but it's been quite
useful. I'd like to see a clear explanation of why it exists and what use cases 
it was intended to support.

Brian

> Since you asked… 
> 
> Simplest answer… your schema should not rely upon internal features of the 
> system.  Since you are tracking your data along the lines of a temporal 
> attribute it should be part of the schema. In terms of a good design, by 
> making it a part of the schema, you’re defining that the data has a temporal 
> property/attribute. 
> 
> Cell versioning is an internal feature of HBase. Its there for a reason. 
> Perhaps one of the committers should expand on why its there.  (When I asked 
> this earlier, never got an answer. ) 
> 
> 
> Longer answer… review how HBase stores the rows, including the versions of 
> the cell. 
> You’re putting an unnecessary stress on the system. 
> 
> Its just not Zen… ;-) 
> 
> The reason I’m a bit short on this topic is that its an issue that keeps 
> coming up, over and over again because some idiot keeps looking to take a 
> shortcut without understanding the implications of their decision. Just like 
> salting the key. (Note:  prepending a truncated hash isn’t the same as using 
> a salt.  Salting has a specific meaning and the salt is orthogonal to the 
> underlying key. Any relationship between the salt and the key is purely 
> random luck.) 
> 
> Does that help? 
> (BTW, this should be part of any schema design talk… yet somehow I think its 
> not covered… ) 
> 
> -Mike
> 
> PS. Its not weird that the cell versions are checked. It makes perfect sense. 
> 
> On Apr 12, 2014, at 2:55 PM, Guillermo Ortiz <konstt2...@gmail.com> wrote:
> 
>> Well, It was just a example why I could keep a thousand versions or a cell.
>> I didn't know that HBase was checking each version when I do a scan, it's a
>> little weird when data is sorted.
>> 
>> You get my attention with your comment, that it's better to store data over
>> time with new columns that with versions. Why is it better?
>> Versions looks that there're very convenient for that use case. So, does it
>> work better a rowkey with 3600 columns, that a rowkey with a column with
>> 3600 versions? What's the reason for avoiding a massive use of versions?
>> 
>> 
>> 2014-04-12 15:07 GMT+02:00 Michael Segel <michael_se...@hotmail.com>:
>> 
>>> Silly question...
>>> 
>>> Why does the idea of using versioning to capture temporal changes to data
>>> keep being propagated?
>>> 
>>> Seriously this issue keeps popping up...
>>> 
>>> If you want to capture data over time... use a timestamp as part of the
>>> column name.  Don't abuse the cell's version.
>>> 
>>> 
>>> 
>>> On Apr 11, 2014, at 11:03 AM, gortiz <gor...@pragsis.com> wrote:
>>> 
>>>> Yes, I have tried with two different values for that value of versions,
>>> 1000 and maximum value for integers.
>>>> 
>>>> But, I want to keep those versions. I don't want to keep just 3
>>> versions. Imagine that I want to record a new version each minute and store
>>> a day, those are 1440 versions.
>>>> 
>>>> Why is HBase going to read all the versions?? , I thought, if you don't
>>> indicate any versions it's just read the newest and skip the rest. It
>>> doesn't make too much sense to read all of them if data is sorted, plus the
>>> newest version is stored in the top.
>>>> 
>>>> 
>>>> On 11/04/14 11:54, Anoop John wrote:
>>>>> What is the max version setting u have done for ur table cf?  When u set
>>>>> some a value, HBase has to keep all those versions.  During a scan it
>>> will
>>>>> read all those versions. In 94 version the default value for the max
>>>>> versions is 3.  I guess you have set some bigger value.   If u have not,
>>>>> mind testing after a major compaction?
>>>>> 
>>>>> -Anoop-
>>>>> 
>>>>> On Fri, Apr 11, 2014 at 1:01 PM, gortiz <gor...@pragsis.com> wrote:
>>>>> 
>>>>>> Last test I have done it's to reduce the number of versions to 100.
>>>>>> So, right now, I have 100 rows with 100 versions each one.
>>>>>> Times are: (I got the same times for blocksize of 64Ks and 1Mb)
>>>>>> 100row-1000versions + blockcache-> 80s.
>>>>>> 100row-1000versions + No blockcache-> 70s.
>>>>>> 
>>>>>> 100row-*100*versions + blockcache-> 7.3s.
>>>>>> 100row-*100*versions + No blockcache-> 6.1s.
>>>>>> 
>>>>>> What's the reasons of this? I guess HBase is enough smart for not
>>> consider
>>>>>> old versions, so, it just checks the newest. But, I reduce 10 times the
>>>>>> size (in versions) and I got a 10x of performance.
>>>>>> 
>>>>>> The filter is scan 'filters', {FILTER => "ValueFilter(=,
>>>>>> 'binary:5')",STARTROW => '1010000000000000000000000000000000000101',
>>>>>> STOPROW => '6010000000000000000000000000000000000201'}
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 11/04/14 09:04, gortiz wrote:
>>>>>> 
>>>>>>> Well, I guessed that, what it doesn't make too much sense because
>>> it's so
>>>>>>> slow. I only have right now 100 rows with 1000 versions each row.
>>>>>>> I have checked the size of the dataset and each row is about 700Kbytes
>>>>>>> (around 7Gb, 100rowsx1000versions). So, it should only check 100 rows
>>> x
>>>>>>> 700Kbytes = 70Mb, since it just check the newest version. How can it
>>> spend
>>>>>>> too many time checking this quantity of data?
>>>>>>> 
>>>>>>> I'm generating again the dataset with a bigger blocksize (previously
>>> was
>>>>>>> 64Kb, now, it's going to be 1Mb). I could try tunning the scanning and
>>>>>>> baching parameters, but I don't think they're going to affect too
>>> much.
>>>>>>> 
>>>>>>> Another test I want to do, it's generate the same dataset with just
>>>>>>> 100versions, It should spend around the same time, right? Or am I
>>> wrong?
>>>>>>> 
>>>>>>> On 10/04/14 18:08, Ted Yu wrote:
>>>>>>> 
>>>>>>>> It should be newest version of each value.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Apr 10, 2014 at 9:55 AM, gortiz <gor...@pragsis.com> wrote:
>>>>>>>> 
>>>>>>>> Another little question is, when the filter I'm using, Do I check
>>> all the
>>>>>>>>> versions? or just the newest? Because, I'm wondering if when I do a
>>> scan
>>>>>>>>> over all the table, I look for the value "5" in all the dataset or
>>> I'm
>>>>>>>>> just
>>>>>>>>> looking for in one newest version of each value.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 10/04/14 16:52, gortiz wrote:
>>>>>>>>> 
>>>>>>>>> I was trying to check the behaviour of HBase. The cluster is a
>>> group of
>>>>>>>>>> old computers, one master, five slaves, each one with 2Gb, so,
>>> 12gb in
>>>>>>>>>> total.
>>>>>>>>>> The table has a column family with 1000 columns and each column
>>> with
>>>>>>>>>> 100
>>>>>>>>>> versions.
>>>>>>>>>> There's another column faimily with four columns an one image of
>>> 100kb.
>>>>>>>>>> (I've tried without this column family as well.)
>>>>>>>>>> The table is partitioned manually in all the slaves, so data are
>>>>>>>>>> balanced
>>>>>>>>>> in the cluster.
>>>>>>>>>> 
>>>>>>>>>> I'm executing this sentence *scan 'table1', {FILTER =>
>>> "ValueFilter(=,
>>>>>>>>>> 'binary:5')"* in HBase 0.94.6
>>>>>>>>>> My time for lease and rpc is three minutes.
>>>>>>>>>> Since, it's a full scan of the table, I have been playing with the
>>>>>>>>>> BLOCKCACHE as well (just disable and enable, not about the size of
>>>>>>>>>> it). I
>>>>>>>>>> thought that it was going to have too much calls to the GC. I'm not
>>>>>>>>>> sure
>>>>>>>>>> about this point.
>>>>>>>>>> 
>>>>>>>>>> I know that it's not the best way to use HBase, it's just a test. I
>>>>>>>>>> think
>>>>>>>>>> that it's not working because the hardware isn't enough, although,
>>> I
>>>>>>>>>> would
>>>>>>>>>> like to try some kind of tunning to improve it.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 10/04/14 14:21, Ted Yu wrote:
>>>>>>>>>> 
>>>>>>>>>> Can you give us a bit more information:
>>>>>>>>>>> HBase release you're running
>>>>>>>>>>> What filters are used for the scan
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> 
>>>>>>>>>>> On Apr 10, 2014, at 2:36 AM, gortiz <gor...@pragsis.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I got this error when I execute a full scan with filters about a
>>>>>>>>>>> table.
>>>>>>>>>>> 
>>>>>>>>>>>> Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.
>>>>>>>>>>>> regionserver.LeaseException:
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.LeaseException: lease
>>>>>>>>>>>> '-4165751462641113359' does not exist
>>>>>>>>>>>>   at
>>> org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>   at org.apache.hadoop.hbase.regionserver.HRegionServer.
>>>>>>>>>>>> next(HRegionServer.java:2482)
>>>>>>>>>>>>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>>>>>>>>>>>   at sun.reflect.NativeMethodAccessorImpl.invoke(
>>>>>>>>>>>> NativeMethodAccessorImpl.java:39)
>>>>>>>>>>>>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>>>>>>>>>>> DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>>>   at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>>>   at
>>> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(
>>>>>>>>>>>> WritableRpcEngine.java:320)
>>>>>>>>>>>>   at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(
>>>>>>>>>>>> HBaseServer.java:1428)
>>>>>>>>>>>> 
>>>>>>>>>>>> I have read about increase the lease time and rpc time, but it's
>>> not
>>>>>>>>>>>> working.. what else could I try?? The table isn't too big. I have
>>>>>>>>>>>> been
>>>>>>>>>>>> checking the logs from GC, HMaster and some RegionServers and I
>>>>>>>>>>>> didn't see
>>>>>>>>>>>> anything weird. I tried as well to try with a couple of caching
>>>>>>>>>>>> values.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>> *Guillermo Ortiz*
>>>>>>>>> /Big Data Developer/
>>>>>>>>> 
>>>>>>>>> Telf.: +34 917 680 490<
>>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>> 
>>>>>>>>> Fax: +34 913 833 301<
>>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>> 
>>>>>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>>>>>>> 
>>>>>>>>> _http://www.bidoop.es_
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>> --
>>>>>> *Guillermo Ortiz*
>>>>>> /Big Data Developer/
>>>>>> 
>>>>>> Telf.: +34 917 680 490<
>>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>> 
>>>>>> Fax: +34 913 833 301<
>>> https://mail.google.com/mail/u/0/html/compose/static_files/blank_quirks.html#
>>>> 
>>>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>>>> 
>>>>>> _http://www.bidoop.es_
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Guillermo Ortiz*
>>>> /Big Data Developer/
>>>> 
>>>> Telf.: +34 917 680 490
>>>> Fax: +34 913 833 301
>>>> C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
>>>> 
>>>> _http://www.bidoop.es_
>>>> 
>>> 
>>> 
> 
> 

Reply via email to