Re: Rowkey design question

2015-04-17 Thread Michael Segel
Sorry, but … 

We are in violent agreement. 
If done wrong it can and will kill you. 
Murphy’s law. If there’s more than one way to do something … the wrong way will 
be chosen, so where does that leave you? 

And then what hasn’t been said is the security concern which is odd because 
with XASecure now Ranger (until they try a new and different name), you need to 
use coprocessors on a trigger to see if you have permission to write to, or 
read data in a table. 

If you’re new to HBase… don’t use a coprocessor. That’s just asking for 
trouble.  (And I know everyone here knows that to be the truth.) 

 On Apr 12, 2015, at 1:45 AM, lars hofhansl la...@apache.org wrote:
 
 After the fun interlude (sorry about that) let me get back to the issue.
 
 There a multiple consideration:
 
 1. row vs column. If in doubt err on the side of more rows. Only use many 
 columns in a row when you need transaction over the data in the columns.
 2. Value sizes. HBase is good at dealing with many small things. 1-5mb values 
 here and there are OK, but most rows should be  a few dozen KBs. Otherwise 
 you'll see too much write amplification.
 3. Column families. Place columns you typically access together in the same 
 column family, and try to keep columns you don't access together mostly in 
 different families.
 HBase can than efficiently rule out a large body of data to scan, by avoiding 
 scanning families that are not needed.
 4. Coprocessors and filters let you transform/filter things where the data 
 is. The benefit can be huge.  With coprocessors you can trap scan requests 
 (next() calls) and inject your own logic.
 Thats what Phoenix does for example, and it's pretty efficient if done right 
 (if you do it wrong you can kill your region server).
 
 On #2. You might want to invent a scheme where you store smaller values by 
 value (i.e. in HBase) and larger ones by reference.
 
 I would put the column with the large value in its own family so that you 
 could scan the rest of the metadata without requiring HBase to read the large 
 value.
 You can follow a simple protocol:
 A. If the value is small (pick some notion of small between 1 and 10mb), 
 store it in HBase, in a separate familY.
 B. Otherwise:
 1. Write a row with the intended location of the file holding the value in 
 HDFS.
 2. Write the value into the HDFS file. Make sure the file location has a 
 random element to avoid races.
 3. Update the row created in #1 with a commit column (just a column you set 
 to true), this is like a commit.
 (only when a writer reaches this point should the value be considered written)
 
 Note the everything is idempotent. The worst that can happen is that the 
 process fails between #2 and #3. Now you have orphaned data in HDFS. Since 
 the HDFS location has a random element in it, you can just retry.
 You can either leave orphaned data (since the commit bit is not set, it's not 
 visible to a client), or you periodically look for those and clean them up.
 
 Hope this helps. Please let us know how it goes.
 
 -- Lars
 
 
 
 From: Kristoffer Sjögren sto...@gmail.com
 To: user@hbase.apache.org 
 Sent: Wednesday, April 8, 2015 6:41 AM
 Subject: Re: Rowkey design question
 
 
 Yes, I think you're right. Adding one or more dimensions to the rowkey
 would indeed make the table narrower.
 
 And I guess it also make sense to store actual values (bigger qualifiers)
 outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on SSD
 caches would be an interesting solution. And quite a bit simpler.
 
 Good call and thanks for the tip! :-)
 
 
 
 
 On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 Ok…
 
 First, I’d suggest you rethink your schema by adding an additional
 dimension.
 You’ll end up with more rows, but a narrower table.
 
 In terms of compaction… if the data is relatively static, you won’t have
 compactions because nothing changed.
 But if your data is that static… why not put the data in sequence files
 and use HBase as the index. Could be faster.
 
 HTH
 
 -Mike
 
 On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 I just read through HBase MOB design document and one thing that caught
 my
 attention was the following statement.
 
 When HBase deals with large numbers of values  100kb and up to ~10MB of
 data, it encounters performance degradations due to write amplification
 caused by splits and compactions.
 
 Is there any chance to run into this problem in the read path for data
 that
 is written infrequently and never changed?
 
 On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 A small set of qualifiers will be accessed frequently so keeping them in
 block cache would be very beneficial. Some very seldom. So this sounds
 very
 promising!
 
 The reason why i'm considering a coprocessor is that I need to provide
 very specific information in the query request. Same thing with the
 response

Re: Rowkey design question

2015-04-12 Thread lars hofhansl
After the fun interlude (sorry about that) let me get back to the issue.

There a multiple consideration:

1. row vs column. If in doubt err on the side of more rows. Only use many 
columns in a row when you need transaction over the data in the columns.
2. Value sizes. HBase is good at dealing with many small things. 1-5mb values 
here and there are OK, but most rows should be  a few dozen KBs. Otherwise 
you'll see too much write amplification.
3. Column families. Place columns you typically access together in the same 
column family, and try to keep columns you don't access together mostly in 
different families.
HBase can than efficiently rule out a large body of data to scan, by avoiding 
scanning families that are not needed.
4. Coprocessors and filters let you transform/filter things where the data is. 
The benefit can be huge.  With coprocessors you can trap scan requests 
(next() calls) and inject your own logic.
Thats what Phoenix does for example, and it's pretty efficient if done right 
(if you do it wrong you can kill your region server).

On #2. You might want to invent a scheme where you store smaller values by 
value (i.e. in HBase) and larger ones by reference.

I would put the column with the large value in its own family so that you could 
scan the rest of the metadata without requiring HBase to read the large value.
You can follow a simple protocol:
A. If the value is small (pick some notion of small between 1 and 10mb), store 
it in HBase, in a separate familY.
B. Otherwise:
1. Write a row with the intended location of the file holding the value in HDFS.
2. Write the value into the HDFS file. Make sure the file location has a random 
element to avoid races.
3. Update the row created in #1 with a commit column (just a column you set to 
true), this is like a commit.
(only when a writer reaches this point should the value be considered written)

Note the everything is idempotent. The worst that can happen is that the 
process fails between #2 and #3. Now you have orphaned data in HDFS. Since the 
HDFS location has a random element in it, you can just retry.
You can either leave orphaned data (since the commit bit is not set, it's not 
visible to a client), or you periodically look for those and clean them up.

Hope this helps. Please let us know how it goes.

-- Lars



From: Kristoffer Sjögren sto...@gmail.com
To: user@hbase.apache.org 
Sent: Wednesday, April 8, 2015 6:41 AM
Subject: Re: Rowkey design question


Yes, I think you're right. Adding one or more dimensions to the rowkey
would indeed make the table narrower.

And I guess it also make sense to store actual values (bigger qualifiers)
outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on SSD
caches would be an interesting solution. And quite a bit simpler.

Good call and thanks for the tip! :-)




On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel michael_se...@hotmail.com
wrote:

 Ok…

 First, I’d suggest you rethink your schema by adding an additional
 dimension.
 You’ll end up with more rows, but a narrower table.

 In terms of compaction… if the data is relatively static, you won’t have
 compactions because nothing changed.
 But if your data is that static… why not put the data in sequence files
 and use HBase as the index. Could be faster.

 HTH

 -Mike

  On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
  I just read through HBase MOB design document and one thing that caught
 my
  attention was the following statement.
 
  When HBase deals with large numbers of values  100kb and up to ~10MB of
  data, it encounters performance degradations due to write amplification
  caused by splits and compactions.
 
  Is there any chance to run into this problem in the read path for data
 that
  is written infrequently and never changed?
 
  On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
  A small set of qualifiers will be accessed frequently so keeping them in
  block cache would be very beneficial. Some very seldom. So this sounds
 very
  promising!
 
  The reason why i'm considering a coprocessor is that I need to provide
  very specific information in the query request. Same thing with the
  response. Queries are also highly parallelizable across rows and each
  individual query produce a valid result that may or may not be
 aggregated
  with other results in the client, maybe even inside the region if it
  contained multiple rows targeted by the query.
 
  So it's a bit like Phoenix but with a different storage format and query
  engine.
 
  On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk ndimi...@gmail.com
 wrote:
 
  Those rows are written out into HBase blocks on cell boundaries. Your
  column family has a BLOCK_SIZE attribute, which you may or may have no
  overridden the default of 64k. Cells are written into a block until is
 it
  = the target block size. So your single 500mb row will be broken down
  into
  thousands of HFile

Re: Rowkey design question

2015-04-11 Thread Sean Busbey
Lars, Andrew, Michael,

This particular discussion isn't bearing fruit for the user@hbase audience.
If you wish to continue it, especially with the current tone, please do so
on dev@.

Michael, IANAL but the ASF offers indemnification as a means of encouraging
development and adoption of the projects it hosts. If you'd like to know
about the specific protections afforded you as a contributor please take it
up with legal@apache.

-- 
Sean
On Apr 11, 2015 12:59 PM, Michael Segel michael_se...@hotmail.com wrote:

 Well Lars, looks like that hypoxia has set in…

 If you’ve paid attention, its not that I’m against server side
 extensibility.

 Its how its been implemented which is a bit brain dead.

 I suggest you think more about why having end user code running in the
 same JVM as the RS is not a good thing.
 (Which is why in Feb. Andrew made a patch that allowed one to turn off the
 coprocessor function completely or after the system coprocessors loaded. )

 The sad truth is that you could have run the coprocessor code in a
 separate JVM.
 You have to remember coprocessors are triggers, stored procedures and
 extensibility all rolled in to one.

 As to providing a patch… will you indemnify me if I get sued?  ;-)
 Didn’t think so.

  On Apr 9, 2015, at 10:13 PM, lars hofhansl la...@apache.org wrote:
 
  if you lecture people and call them stupid (as you did in an earlier
 email)
  He said (quote) committers are suffering from rectal induced hypoxia,
 we can let that pass as stupid, I think. :)Maybe Michael can explain some
 day what rectal induced hypoxia is. I'm dying to know what I suffer from.
 
  In any case and in all seriousness. Michael, feel free to educate
 yourself about what the intended use of coprocessors is - preferably before
 you come here and start an argument ... again. We're more than happy to
 accept a patch from you with a correct implementation.
 
  Can we just let this thread die? It didn't start with a useful
 proposition.
 
  -- Lars
 
  From: Andrew Purtell apurt...@apache.org
  To: user@hbase.apache.org user@hbase.apache.org
  Sent: Thursday, April 9, 2015 4:53 PM
  Subject: Re: Rowkey design question
 
  On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel michael_se...@hotmail.com
 
  wrote:
 
  Hint: You could have sandboxed the end user code which makes it a lot
  easier to manage.
 
 
  I filed the fucking JIRA for that. Look at HBASE-4047. As a matter of
  social grace, if you lecture people and call them stupid (as you did in
 an
  earlier email) while making the same fucking argument the other person
  made, this doesn't work.
 
  The reason I never did finish HBASE-4047 is I didn't need to. Nobody here
  or where I worked, ultimately, was banging down the door for an external
  coprocessor host. What we have works well enough for people today.
 
  If you do think the external coprocessor host is essential, try taking on
  the actual engineering challenges involved. Hint: They are not easy. Put
 up
  a patch. Writing words in an email is easy. ​
 
 
 
 
 
 
  --
  Best regards,
 
- Andy
 
  Problems worthy of attack prove their worth by hitting back. - Piet Hein
  (via Tom White)
 

 The opinions expressed here are mine, while they may reflect a cognitive
 thought, that is purely accidental.
 Use at your own risk.
 Michael Segel
 michael_segel (AT) hotmail.com








Re: Rowkey design question

2015-04-11 Thread Kevin O'dell
Trying to figure out the best place to jump in here...

Kristoffer,

  I would like to echo what Michael and Andrew have said.  While a
pre-aggregation co-proc may work in my experience with co-procs they are
typically more trouble than they are worth.  I would first try this outside
the client taking advantage of filters.

How is this data coming in?  Could we help with some pre-aggregation with
Flume interceptors or Storm and enrich the events in flight?  This could
help take some work off of the client give you the speed you need without
dropping custom code into the RS JVM...which should ALWAYS be the last
resort.

On Thu, Apr 9, 2015 at 4:02 PM, Andrew Purtell apurt...@apache.org wrote:

 This is one person's opinion, to which he is absolutely entitled to, but
 blanket black and white statements like coprocessors are poorly
 implemented is obviously not an opinion shared by all those who have used
 them successfully, nor the HBase committers, or we would remove the
 feature. On the other hand, you should really ask yourself if in-server
 extension is necessary. That should be a last resort, really, for the
 security and performance considerations Michael mentions.


 On Thu, Apr 9, 2015 at 5:05 AM, Michael Segel michael_se...@hotmail.com
 wrote:

  Ok…
  Coprocessors are poorly implemented in HBase.
  If you work in a secure environment, outside of the system coprocessors…
  (ones that you load from hbase-site.xml) , you don’t want to use them.
 (The
  coprocessor code runs on the same JVM as the RS.)  This means that if you
  have a poorly written coprocessor, you will kill performance for all of
  HBase. If you’re not using them in a secure environment, you have to
  consider how they are going to be used.
 
 
  Without really knowing more about your use case..., its impossible to say
  of the coprocessor would be a good idea.
 
 
  It sounds like you may have an unrealistic expectation as to how well
  HBase performs.
 
  HTH
 
  -Mike
 
   On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
  
   An HBase coprocessor. My idea is to move as much pre-aggregation as
   possible to where the data lives in the region servers, instead of
 doing
  it
   in the client. If there is good data locality inside and across rows
  within
   regions then I would expect aggregation to be faster in the coprocessor
   (utilize many region servers in parallel) rather than transfer data
 over
   the network from multiple region servers to a single client that would
 do
   the same calculation on its own.
  
  
   On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel 
 michael_se...@hotmail.com
  
   wrote:
  
   When you say coprocessor, do you mean HBase coprocessors or do you
 mean
  a
   physical hardware coprocessor?
  
   In terms of queries…
  
   HBase can perform a single get() and return the result back quickly.
  (The
   size of the data being returned will impact the overall timing.)
  
   HBase also caches the results so that your first hit will take the
   longest, but as long as the row is cached, the results are returned
  quickly.
  
   If you’re trying to do a scan with a start/stop row set … your timing
  then
   could vary between sub-second and minutes depending on the query.
  
  
   On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren sto...@gmail.com
  wrote:
  
   But if the coprocessor is omitted then CPU cycles from region servers
  are
   lost, so where would the query execution go?
  
   Queries needs to be quick (sub-second rather than seconds) and HDFS
 is
   quite latency hungry, unless there are optimizations that i'm unaware
  of?
  
  
  
   On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel 
  michael_se...@hotmail.com
  
   wrote:
  
   I think you misunderstood.
  
   The suggestion was to put the data in to HDFS sequence files and to
  use
   HBase to store an index in to the file. (URL to the file, then
 offset
   in to
   the file for the start of the record…)
  
   The reason you want to do this is that you’re reading in large
 amounts
   of
   data and its more efficient to do this from HDFS than through HBase.
  
   On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren sto...@gmail.com
   wrote:
  
   Yes, I think you're right. Adding one or more dimensions to the
  rowkey
   would indeed make the table narrower.
  
   And I guess it also make sense to store actual values (bigger
   qualifiers)
   outside HBase. Keeping them in Hadoop why not? Pulling hot ones out
  on
   SSD
   caches would be an interesting solution. And quite a bit simpler.
  
   Good call and thanks for the tip! :-)
  
   On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel 
   michael_se...@hotmail.com
  
   wrote:
  
   Ok…
  
   First, I’d suggest you rethink your schema by adding an additional
   dimension.
   You’ll end up with more rows, but a narrower table.
  
   In terms of compaction… if the data is relatively static, you
 won’t
   have
   compactions because nothing changed.
   But if your data is that 

Re: Rowkey design question

2015-04-11 Thread Andrew Purtell
Yes, the tone is a problem here, but the good news is rectally induced
hypoxia isn't a real medical condition, a patch seems possible, social
grace isn't required for producing a patch, and patches are always welcome.
What else is there to say, really? I think we're done.

On Saturday, April 11, 2015, Sean Busbey bus...@cloudera.com wrote:

 Lars, Andrew, Michael,

 This particular discussion isn't bearing fruit for the user@hbase
 audience.
 If you wish to continue it, especially with the current tone, please do so
 on dev@.

 Michael, IANAL but the ASF offers indemnification as a means of encouraging
 development and adoption of the projects it hosts. If you'd like to know
 about the specific protections afforded you as a contributor please take it
 up with legal@apache.

 --
 Sean
 On Apr 11, 2015 12:59 PM, Michael Segel michael_se...@hotmail.com
 javascript:; wrote:

  Well Lars, looks like that hypoxia has set in…
 
  If you’ve paid attention, its not that I’m against server side
  extensibility.
 
  Its how its been implemented which is a bit brain dead.
 
  I suggest you think more about why having end user code running in the
  same JVM as the RS is not a good thing.
  (Which is why in Feb. Andrew made a patch that allowed one to turn off
 the
  coprocessor function completely or after the system coprocessors loaded.
 )
 
  The sad truth is that you could have run the coprocessor code in a
  separate JVM.
  You have to remember coprocessors are triggers, stored procedures and
  extensibility all rolled in to one.
 
  As to providing a patch… will you indemnify me if I get sued?  ;-)
  Didn’t think so.
 
   On Apr 9, 2015, at 10:13 PM, lars hofhansl la...@apache.org
 javascript:; wrote:
  
   if you lecture people and call them stupid (as you did in an earlier
  email)
   He said (quote) committers are suffering from rectal induced hypoxia,
  we can let that pass as stupid, I think. :)Maybe Michael can explain
 some
  day what rectal induced hypoxia is. I'm dying to know what I suffer
 from.
  
   In any case and in all seriousness. Michael, feel free to educate
  yourself about what the intended use of coprocessors is - preferably
 before
  you come here and start an argument ... again. We're more than happy to
  accept a patch from you with a correct implementation.
  
   Can we just let this thread die? It didn't start with a useful
  proposition.
  
   -- Lars
  
   From: Andrew Purtell apurt...@apache.org javascript:;
   To: user@hbase.apache.org javascript:; user@hbase.apache.org
 javascript:;
   Sent: Thursday, April 9, 2015 4:53 PM
   Subject: Re: Rowkey design question
  
   On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel 
 michael_se...@hotmail.com javascript:;
  
   wrote:
  
   Hint: You could have sandboxed the end user code which makes it a lot
   easier to manage.
  
  
   I filed the fucking JIRA for that. Look at HBASE-4047. As a matter of
   social grace, if you lecture people and call them stupid (as you did in
  an
   earlier email) while making the same fucking argument the other person
   made, this doesn't work.
  
   The reason I never did finish HBASE-4047 is I didn't need to. Nobody
 here
   or where I worked, ultimately, was banging down the door for an
 external
   coprocessor host. What we have works well enough for people today.
  
   If you do think the external coprocessor host is essential, try taking
 on
   the actual engineering challenges involved. Hint: They are not easy.
 Put
  up
   a patch. Writing words in an email is easy. ​
  
  
  
  
  
  
   --
   Best regards,
  
 - Andy
  
   Problems worthy of attack prove their worth by hitting back. - Piet
 Hein
   (via Tom White)
  
 
  The opinions expressed here are mine, while they may reflect a cognitive
  thought, that is purely accidental.
  Use at your own risk.
  Michael Segel
  michael_segel (AT) hotmail.com
 
 
 
 
 
 



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)


Re: Rowkey design question

2015-04-11 Thread Michael Segel
Well Lars, looks like that hypoxia has set in… 

If you’ve paid attention, its not that I’m against server side extensibility. 

Its how its been implemented which is a bit brain dead. 

I suggest you think more about why having end user code running in the same JVM 
as the RS is not a good thing.
(Which is why in Feb. Andrew made a patch that allowed one to turn off the 
coprocessor function completely or after the system coprocessors loaded. ) 

The sad truth is that you could have run the coprocessor code in a separate 
JVM. 
You have to remember coprocessors are triggers, stored procedures and 
extensibility all rolled in to one.

As to providing a patch… will you indemnify me if I get sued?  ;-) 
Didn’t think so.

 On Apr 9, 2015, at 10:13 PM, lars hofhansl la...@apache.org wrote:
 
 if you lecture people and call them stupid (as you did in an earlier email) 
 He said (quote) committers are suffering from rectal induced hypoxia, we 
 can let that pass as stupid, I think. :)Maybe Michael can explain some day 
 what rectal induced hypoxia is. I'm dying to know what I suffer from.
 
 In any case and in all seriousness. Michael, feel free to educate yourself 
 about what the intended use of coprocessors is - preferably before you come 
 here and start an argument ... again. We're more than happy to accept a patch 
 from you with a correct implementation.
 
 Can we just let this thread die? It didn't start with a useful proposition.
 
 -- Lars
 
 From: Andrew Purtell apurt...@apache.org
 To: user@hbase.apache.org user@hbase.apache.org 
 Sent: Thursday, April 9, 2015 4:53 PM
 Subject: Re: Rowkey design question
 
 On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 Hint: You could have sandboxed the end user code which makes it a lot
 easier to manage.
 
 
 I filed the fucking JIRA for that. Look at HBASE-4047. As a matter of
 social grace, if you lecture people and call them stupid (as you did in an
 earlier email) while making the same fucking argument the other person
 made, this doesn't work.
 
 The reason I never did finish HBASE-4047 is I didn't need to. Nobody here
 or where I worked, ultimately, was banging down the door for an external
 coprocessor host. What we have works well enough for people today.
 
 If you do think the external coprocessor host is essential, try taking on
 the actual engineering challenges involved. Hint: They are not easy. Put up
 a patch. Writing words in an email is easy. ​
 
 
 
 
 
 
 -- 
 Best regards,
 
   - Andy
 
 Problems worthy of attack prove their worth by hitting back. - Piet Hein
 (via Tom White)
 

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com







Re: Rowkey design question

2015-04-09 Thread Michael Segel
Ok… 
Coprocessors are poorly implemented in HBase. 
If you work in a secure environment, outside of the system coprocessors… (ones 
that you load from hbase-site.xml) , you don’t want to use them. (The 
coprocessor code runs on the same JVM as the RS.)  This means that if you have 
a poorly written coprocessor, you will kill performance for all of HBase. If 
you’re not using them in a secure environment, you have to consider how they 
are going to be used.  


Without really knowing more about your use case..., its impossible to say of 
the coprocessor would be a good idea. 


It sounds like you may have an unrealistic expectation as to how well HBase 
performs. 

HTH

-Mike

 On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 An HBase coprocessor. My idea is to move as much pre-aggregation as
 possible to where the data lives in the region servers, instead of doing it
 in the client. If there is good data locality inside and across rows within
 regions then I would expect aggregation to be faster in the coprocessor
 (utilize many region servers in parallel) rather than transfer data over
 the network from multiple region servers to a single client that would do
 the same calculation on its own.
 
 
 On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 When you say coprocessor, do you mean HBase coprocessors or do you mean a
 physical hardware coprocessor?
 
 In terms of queries…
 
 HBase can perform a single get() and return the result back quickly. (The
 size of the data being returned will impact the overall timing.)
 
 HBase also caches the results so that your first hit will take the
 longest, but as long as the row is cached, the results are returned quickly.
 
 If you’re trying to do a scan with a start/stop row set … your timing then
 could vary between sub-second and minutes depending on the query.
 
 
 On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 But if the coprocessor is omitted then CPU cycles from region servers are
 lost, so where would the query execution go?
 
 Queries needs to be quick (sub-second rather than seconds) and HDFS is
 quite latency hungry, unless there are optimizations that i'm unaware of?
 
 
 
 On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel michael_se...@hotmail.com
 
 wrote:
 
 I think you misunderstood.
 
 The suggestion was to put the data in to HDFS sequence files and to use
 HBase to store an index in to the file. (URL to the file, then offset
 in to
 the file for the start of the record…)
 
 The reason you want to do this is that you’re reading in large amounts
 of
 data and its more efficient to do this from HDFS than through HBase.
 
 On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 Yes, I think you're right. Adding one or more dimensions to the rowkey
 would indeed make the table narrower.
 
 And I guess it also make sense to store actual values (bigger
 qualifiers)
 outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on
 SSD
 caches would be an interesting solution. And quite a bit simpler.
 
 Good call and thanks for the tip! :-)
 
 On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel 
 michael_se...@hotmail.com
 
 wrote:
 
 Ok…
 
 First, I’d suggest you rethink your schema by adding an additional
 dimension.
 You’ll end up with more rows, but a narrower table.
 
 In terms of compaction… if the data is relatively static, you won’t
 have
 compactions because nothing changed.
 But if your data is that static… why not put the data in sequence
 files
 and use HBase as the index. Could be faster.
 
 HTH
 
 -Mike
 
 On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 I just read through HBase MOB design document and one thing that
 caught
 my
 attention was the following statement.
 
 When HBase deals with large numbers of values  100kb and up to
 ~10MB
 of
 data, it encounters performance degradations due to write
 amplification
 caused by splits and compactions.
 
 Is there any chance to run into this problem in the read path for
 data
 that
 is written infrequently and never changed?
 
 On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren sto...@gmail.com
 
 wrote:
 
 A small set of qualifiers will be accessed frequently so keeping
 them
 in
 block cache would be very beneficial. Some very seldom. So this
 sounds
 very
 promising!
 
 The reason why i'm considering a coprocessor is that I need to
 provide
 very specific information in the query request. Same thing with the
 response. Queries are also highly parallelizable across rows and
 each
 individual query produce a valid result that may or may not be
 aggregated
 with other results in the client, maybe even inside the region if it
 contained multiple rows targeted by the query.
 
 So it's a bit like Phoenix but with a different storage format and
 query
 engine.
 
 On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk ndimi...@gmail.com
 wrote:
 
 Those rows are written out into 

Re: Rowkey design question

2015-04-09 Thread Kristoffer Sjögren
An HBase coprocessor. My idea is to move as much pre-aggregation as
possible to where the data lives in the region servers, instead of doing it
in the client. If there is good data locality inside and across rows within
regions then I would expect aggregation to be faster in the coprocessor
(utilize many region servers in parallel) rather than transfer data over
the network from multiple region servers to a single client that would do
the same calculation on its own.


On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel michael_se...@hotmail.com
wrote:

 When you say coprocessor, do you mean HBase coprocessors or do you mean a
 physical hardware coprocessor?

 In terms of queries…

 HBase can perform a single get() and return the result back quickly. (The
 size of the data being returned will impact the overall timing.)

 HBase also caches the results so that your first hit will take the
 longest, but as long as the row is cached, the results are returned quickly.

 If you’re trying to do a scan with a start/stop row set … your timing then
 could vary between sub-second and minutes depending on the query.


  On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren sto...@gmail.com wrote:
 
  But if the coprocessor is omitted then CPU cycles from region servers are
  lost, so where would the query execution go?
 
  Queries needs to be quick (sub-second rather than seconds) and HDFS is
  quite latency hungry, unless there are optimizations that i'm unaware of?
 
 
 
  On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel michael_se...@hotmail.com
 
  wrote:
 
  I think you misunderstood.
 
  The suggestion was to put the data in to HDFS sequence files and to use
  HBase to store an index in to the file. (URL to the file, then offset
 in to
  the file for the start of the record…)
 
  The reason you want to do this is that you’re reading in large amounts
 of
  data and its more efficient to do this from HDFS than through HBase.
 
  On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
  Yes, I think you're right. Adding one or more dimensions to the rowkey
  would indeed make the table narrower.
 
  And I guess it also make sense to store actual values (bigger
 qualifiers)
  outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on
  SSD
  caches would be an interesting solution. And quite a bit simpler.
 
  Good call and thanks for the tip! :-)
 
  On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel 
 michael_se...@hotmail.com
 
  wrote:
 
  Ok…
 
  First, I’d suggest you rethink your schema by adding an additional
  dimension.
  You’ll end up with more rows, but a narrower table.
 
  In terms of compaction… if the data is relatively static, you won’t
 have
  compactions because nothing changed.
  But if your data is that static… why not put the data in sequence
 files
  and use HBase as the index. Could be faster.
 
  HTH
 
  -Mike
 
  On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren sto...@gmail.com
  wrote:
 
  I just read through HBase MOB design document and one thing that
 caught
  my
  attention was the following statement.
 
  When HBase deals with large numbers of values  100kb and up to
 ~10MB
  of
  data, it encounters performance degradations due to write
 amplification
  caused by splits and compactions.
 
  Is there any chance to run into this problem in the read path for
 data
  that
  is written infrequently and never changed?
 
  On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren sto...@gmail.com
 
  wrote:
 
  A small set of qualifiers will be accessed frequently so keeping
 them
  in
  block cache would be very beneficial. Some very seldom. So this
 sounds
  very
  promising!
 
  The reason why i'm considering a coprocessor is that I need to
 provide
  very specific information in the query request. Same thing with the
  response. Queries are also highly parallelizable across rows and
 each
  individual query produce a valid result that may or may not be
  aggregated
  with other results in the client, maybe even inside the region if it
  contained multiple rows targeted by the query.
 
  So it's a bit like Phoenix but with a different storage format and
  query
  engine.
 
  On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk ndimi...@gmail.com
  wrote:
 
  Those rows are written out into HBase blocks on cell boundaries.
 Your
  column family has a BLOCK_SIZE attribute, which you may or may have
  no
  overridden the default of 64k. Cells are written into a block until
  is
  it
  = the target block size. So your single 500mb row will be broken
  down
  into
  thousands of HFile blocks in some number of HFiles. Some of those
  blocks
  may contain just a cell or two and be a couple MB in size, to hold
  the
  largest of your cells. Those blocks will be loaded into the Block
  Cache as
  they're accessed. If your careful with your access patterns and
 only
  request cells that you need to evaluate, you'll only ever load the
  blocks
  containing those cells into the cache.
 
  Will the entire 

Re: Rowkey design question

2015-04-09 Thread Michael Segel
Andrew, 

In a nutshell running end user code within the RS JVM is a bad design. 
To be clear, this is not just my opinion… I just happen to be more vocal about 
it. ;-)
We’ve covered this ground before and just because the code runs doesn’t mean 
its good. Or that the design is good.

I would love to see how you can justify HBase as being secure when you have end 
user code running in the same JVM as the RS. 
I can think of several ways to hack HBase security because of this… 

Note: I’m not saying server side extensibility is bad, I’m saying how it was 
implemented was bad. 
Hint: You could have sandboxed the end user code which makes it a lot easier to 
manage.

MapR has avoided this in their MapRDB. They’re adding the extensibility in a 
different manner and this issue is nothing new. 


And yes. you’ve hit the nail on the head. Rethink your design if you want to 
use coprocessors and use them as a last resort. 

 On Apr 9, 2015, at 3:02 PM, Andrew Purtell apurt...@apache.org wrote:
 
 This is one person's opinion, to which he is absolutely entitled to, but
 blanket black and white statements like coprocessors are poorly
 implemented is obviously not an opinion shared by all those who have used
 them successfully, nor the HBase committers, or we would remove the
 feature. On the other hand, you should really ask yourself if in-server
 extension is necessary. That should be a last resort, really, for the
 security and performance considerations Michael mentions.
 
 
 On Thu, Apr 9, 2015 at 5:05 AM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 Ok…
 Coprocessors are poorly implemented in HBase.
 If you work in a secure environment, outside of the system coprocessors…
 (ones that you load from hbase-site.xml) , you don’t want to use them. (The
 coprocessor code runs on the same JVM as the RS.)  This means that if you
 have a poorly written coprocessor, you will kill performance for all of
 HBase. If you’re not using them in a secure environment, you have to
 consider how they are going to be used.
 
 
 Without really knowing more about your use case..., its impossible to say
 of the coprocessor would be a good idea.
 
 
 It sounds like you may have an unrealistic expectation as to how well
 HBase performs.
 
 HTH
 
 -Mike
 
 On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 An HBase coprocessor. My idea is to move as much pre-aggregation as
 possible to where the data lives in the region servers, instead of doing
 it
 in the client. If there is good data locality inside and across rows
 within
 regions then I would expect aggregation to be faster in the coprocessor
 (utilize many region servers in parallel) rather than transfer data over
 the network from multiple region servers to a single client that would do
 the same calculation on its own.
 
 
 On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel michael_se...@hotmail.com
 
 wrote:
 
 When you say coprocessor, do you mean HBase coprocessors or do you mean
 a
 physical hardware coprocessor?
 
 In terms of queries…
 
 HBase can perform a single get() and return the result back quickly.
 (The
 size of the data being returned will impact the overall timing.)
 
 HBase also caches the results so that your first hit will take the
 longest, but as long as the row is cached, the results are returned
 quickly.
 
 If you’re trying to do a scan with a start/stop row set … your timing
 then
 could vary between sub-second and minutes depending on the query.
 
 
 On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 But if the coprocessor is omitted then CPU cycles from region servers
 are
 lost, so where would the query execution go?
 
 Queries needs to be quick (sub-second rather than seconds) and HDFS is
 quite latency hungry, unless there are optimizations that i'm unaware
 of?
 
 
 
 On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel 
 michael_se...@hotmail.com
 
 wrote:
 
 I think you misunderstood.
 
 The suggestion was to put the data in to HDFS sequence files and to
 use
 HBase to store an index in to the file. (URL to the file, then offset
 in to
 the file for the start of the record…)
 
 The reason you want to do this is that you’re reading in large amounts
 of
 data and its more efficient to do this from HDFS than through HBase.
 
 On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 Yes, I think you're right. Adding one or more dimensions to the
 rowkey
 would indeed make the table narrower.
 
 And I guess it also make sense to store actual values (bigger
 qualifiers)
 outside HBase. Keeping them in Hadoop why not? Pulling hot ones out
 on
 SSD
 caches would be an interesting solution. And quite a bit simpler.
 
 Good call and thanks for the tip! :-)
 
 On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel 
 michael_se...@hotmail.com
 
 wrote:
 
 Ok…
 
 First, I’d suggest you rethink your schema by adding an additional
 dimension.
 You’ll end up with more rows, but a narrower table.
 
 In terms of 

Re: Rowkey design question

2015-04-09 Thread Andrew Purtell
This is one person's opinion, to which he is absolutely entitled to, but
blanket black and white statements like coprocessors are poorly
implemented is obviously not an opinion shared by all those who have used
them successfully, nor the HBase committers, or we would remove the
feature. On the other hand, you should really ask yourself if in-server
extension is necessary. That should be a last resort, really, for the
security and performance considerations Michael mentions.


On Thu, Apr 9, 2015 at 5:05 AM, Michael Segel michael_se...@hotmail.com
wrote:

 Ok…
 Coprocessors are poorly implemented in HBase.
 If you work in a secure environment, outside of the system coprocessors…
 (ones that you load from hbase-site.xml) , you don’t want to use them. (The
 coprocessor code runs on the same JVM as the RS.)  This means that if you
 have a poorly written coprocessor, you will kill performance for all of
 HBase. If you’re not using them in a secure environment, you have to
 consider how they are going to be used.


 Without really knowing more about your use case..., its impossible to say
 of the coprocessor would be a good idea.


 It sounds like you may have an unrealistic expectation as to how well
 HBase performs.

 HTH

 -Mike

  On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
  An HBase coprocessor. My idea is to move as much pre-aggregation as
  possible to where the data lives in the region servers, instead of doing
 it
  in the client. If there is good data locality inside and across rows
 within
  regions then I would expect aggregation to be faster in the coprocessor
  (utilize many region servers in parallel) rather than transfer data over
  the network from multiple region servers to a single client that would do
  the same calculation on its own.
 
 
  On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel michael_se...@hotmail.com
 
  wrote:
 
  When you say coprocessor, do you mean HBase coprocessors or do you mean
 a
  physical hardware coprocessor?
 
  In terms of queries…
 
  HBase can perform a single get() and return the result back quickly.
 (The
  size of the data being returned will impact the overall timing.)
 
  HBase also caches the results so that your first hit will take the
  longest, but as long as the row is cached, the results are returned
 quickly.
 
  If you’re trying to do a scan with a start/stop row set … your timing
 then
  could vary between sub-second and minutes depending on the query.
 
 
  On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
  But if the coprocessor is omitted then CPU cycles from region servers
 are
  lost, so where would the query execution go?
 
  Queries needs to be quick (sub-second rather than seconds) and HDFS is
  quite latency hungry, unless there are optimizations that i'm unaware
 of?
 
 
 
  On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel 
 michael_se...@hotmail.com
 
  wrote:
 
  I think you misunderstood.
 
  The suggestion was to put the data in to HDFS sequence files and to
 use
  HBase to store an index in to the file. (URL to the file, then offset
  in to
  the file for the start of the record…)
 
  The reason you want to do this is that you’re reading in large amounts
  of
  data and its more efficient to do this from HDFS than through HBase.
 
  On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren sto...@gmail.com
  wrote:
 
  Yes, I think you're right. Adding one or more dimensions to the
 rowkey
  would indeed make the table narrower.
 
  And I guess it also make sense to store actual values (bigger
  qualifiers)
  outside HBase. Keeping them in Hadoop why not? Pulling hot ones out
 on
  SSD
  caches would be an interesting solution. And quite a bit simpler.
 
  Good call and thanks for the tip! :-)
 
  On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel 
  michael_se...@hotmail.com
 
  wrote:
 
  Ok…
 
  First, I’d suggest you rethink your schema by adding an additional
  dimension.
  You’ll end up with more rows, but a narrower table.
 
  In terms of compaction… if the data is relatively static, you won’t
  have
  compactions because nothing changed.
  But if your data is that static… why not put the data in sequence
  files
  and use HBase as the index. Could be faster.
 
  HTH
 
  -Mike
 
  On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren sto...@gmail.com
  wrote:
 
  I just read through HBase MOB design document and one thing that
  caught
  my
  attention was the following statement.
 
  When HBase deals with large numbers of values  100kb and up to
  ~10MB
  of
  data, it encounters performance degradations due to write
  amplification
  caused by splits and compactions.
 
  Is there any chance to run into this problem in the read path for
  data
  that
  is written infrequently and never changed?
 
  On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren 
 sto...@gmail.com
 
  wrote:
 
  A small set of qualifiers will be accessed frequently so keeping
  them
  in
  block cache would be very beneficial. 

Re: Rowkey design question

2015-04-08 Thread Michael Segel
When you say coprocessor, do you mean HBase coprocessors or do you mean a 
physical hardware coprocessor? 

In terms of queries… 

HBase can perform a single get() and return the result back quickly. (The size 
of the data being returned will impact the overall timing.) 

HBase also caches the results so that your first hit will take the longest, but 
as long as the row is cached, the results are returned quickly. 

If you’re trying to do a scan with a start/stop row set … your timing then 
could vary between sub-second and minutes depending on the query. 


 On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 But if the coprocessor is omitted then CPU cycles from region servers are
 lost, so where would the query execution go?
 
 Queries needs to be quick (sub-second rather than seconds) and HDFS is
 quite latency hungry, unless there are optimizations that i'm unaware of?
 
 
 
 On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 I think you misunderstood.
 
 The suggestion was to put the data in to HDFS sequence files and to use
 HBase to store an index in to the file. (URL to the file, then offset in to
 the file for the start of the record…)
 
 The reason you want to do this is that you’re reading in large amounts of
 data and its more efficient to do this from HDFS than through HBase.
 
 On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 Yes, I think you're right. Adding one or more dimensions to the rowkey
 would indeed make the table narrower.
 
 And I guess it also make sense to store actual values (bigger qualifiers)
 outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on
 SSD
 caches would be an interesting solution. And quite a bit simpler.
 
 Good call and thanks for the tip! :-)
 
 On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel michael_se...@hotmail.com
 
 wrote:
 
 Ok…
 
 First, I’d suggest you rethink your schema by adding an additional
 dimension.
 You’ll end up with more rows, but a narrower table.
 
 In terms of compaction… if the data is relatively static, you won’t have
 compactions because nothing changed.
 But if your data is that static… why not put the data in sequence files
 and use HBase as the index. Could be faster.
 
 HTH
 
 -Mike
 
 On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 I just read through HBase MOB design document and one thing that caught
 my
 attention was the following statement.
 
 When HBase deals with large numbers of values  100kb and up to ~10MB
 of
 data, it encounters performance degradations due to write amplification
 caused by splits and compactions.
 
 Is there any chance to run into this problem in the read path for data
 that
 is written infrequently and never changed?
 
 On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 A small set of qualifiers will be accessed frequently so keeping them
 in
 block cache would be very beneficial. Some very seldom. So this sounds
 very
 promising!
 
 The reason why i'm considering a coprocessor is that I need to provide
 very specific information in the query request. Same thing with the
 response. Queries are also highly parallelizable across rows and each
 individual query produce a valid result that may or may not be
 aggregated
 with other results in the client, maybe even inside the region if it
 contained multiple rows targeted by the query.
 
 So it's a bit like Phoenix but with a different storage format and
 query
 engine.
 
 On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk ndimi...@gmail.com
 wrote:
 
 Those rows are written out into HBase blocks on cell boundaries. Your
 column family has a BLOCK_SIZE attribute, which you may or may have
 no
 overridden the default of 64k. Cells are written into a block until
 is
 it
 = the target block size. So your single 500mb row will be broken
 down
 into
 thousands of HFile blocks in some number of HFiles. Some of those
 blocks
 may contain just a cell or two and be a couple MB in size, to hold
 the
 largest of your cells. Those blocks will be loaded into the Block
 Cache as
 they're accessed. If your careful with your access patterns and only
 request cells that you need to evaluate, you'll only ever load the
 blocks
 containing those cells into the cache.
 
 Will the entire row be loaded or only the qualifiers I ask for?
 
 So then, the answer to your question is: it depends on how you're
 interacting with the row from your coprocessor. The read path will
 only
 load blocks that your scanner requests. If your coprocessor is
 producing
 scanner with to seek to specific qualifiers, you'll only load those
 blocks.
 
 Related question: Is there a reason you're using a coprocessor
 instead
 of
 a
 regular filter, or a simple qualified get/scan to access data from
 these
 rows? The default stuff is already tuned to load data sparsely, as
 would
 be desirable for your schema.
 
 -n
 
 On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer 

Re: Rowkey design question

2015-04-08 Thread Michael Segel
Ok… 

First, I’d suggest you rethink your schema by adding an additional dimension. 
You’ll end up with more rows, but a narrower table. 

In terms of compaction… if the data is relatively static, you won’t have 
compactions because nothing changed. 
But if your data is that static… why not put the data in sequence files and use 
HBase as the index. Could be faster. 

HTH 

-Mike

 On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 I just read through HBase MOB design document and one thing that caught my
 attention was the following statement.
 
 When HBase deals with large numbers of values  100kb and up to ~10MB of
 data, it encounters performance degradations due to write amplification
 caused by splits and compactions.
 
 Is there any chance to run into this problem in the read path for data that
 is written infrequently and never changed?
 
 On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 A small set of qualifiers will be accessed frequently so keeping them in
 block cache would be very beneficial. Some very seldom. So this sounds very
 promising!
 
 The reason why i'm considering a coprocessor is that I need to provide
 very specific information in the query request. Same thing with the
 response. Queries are also highly parallelizable across rows and each
 individual query produce a valid result that may or may not be aggregated
 with other results in the client, maybe even inside the region if it
 contained multiple rows targeted by the query.
 
 So it's a bit like Phoenix but with a different storage format and query
 engine.
 
 On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk ndimi...@gmail.com wrote:
 
 Those rows are written out into HBase blocks on cell boundaries. Your
 column family has a BLOCK_SIZE attribute, which you may or may have no
 overridden the default of 64k. Cells are written into a block until is it
 = the target block size. So your single 500mb row will be broken down
 into
 thousands of HFile blocks in some number of HFiles. Some of those blocks
 may contain just a cell or two and be a couple MB in size, to hold the
 largest of your cells. Those blocks will be loaded into the Block Cache as
 they're accessed. If your careful with your access patterns and only
 request cells that you need to evaluate, you'll only ever load the blocks
 containing those cells into the cache.
 
 Will the entire row be loaded or only the qualifiers I ask for?
 
 So then, the answer to your question is: it depends on how you're
 interacting with the row from your coprocessor. The read path will only
 load blocks that your scanner requests. If your coprocessor is producing
 scanner with to seek to specific qualifiers, you'll only load those
 blocks.
 
 Related question: Is there a reason you're using a coprocessor instead of
 a
 regular filter, or a simple qualified get/scan to access data from these
 rows? The default stuff is already tuned to load data sparsely, as would
 be desirable for your schema.
 
 -n
 
 On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 Sorry I should have explained my use case a bit more.
 
 Yes, it's a pretty big row and it's close to worst case. Normally
 there
 would be fewer qualifiers and the largest qualifiers would be smaller.
 
 The reason why these rows gets big is because they stores aggregated
 data
 in indexed compressed form. This format allow for extremely fast queries
 (on local disk format) over billions of rows (not rows in HBase speak),
 when touching smaller areas of the data. If would store the data as
 regular
 HBase rows things would get very slow unless I had many many region
 servers.
 
 The coprocessor is used for doing custom queries on the indexed data
 inside
 the region servers. These queries are not like a regular row scan, but
 very
 specific as to how the data is formatted withing each column qualifier.
 
 Yes, this is not possible if HBase loads the whole 500MB each time i
 want
 to perform this custom query on a row. Hence my question :-)
 
 
 
 
 On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel 
 michael_se...@hotmail.com
 wrote:
 
 Sorry, but your initial problem statement doesn’t seem to parse …
 
 Are you saying that you a single row with approximately 100,000
 elements
 where each element is roughly 1-5KB in size and in addition there are
 ~5
 elements which will be between one and five MB in size?
 
 And you then mention a coprocessor?
 
 Just looking at the numbers… 100K * 5KB means that each row would end
 up
 being 500MB in size.
 
 That’s a pretty fat row.
 
 I would suggest rethinking your strategy.
 
 On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 Hi
 
 I have a row with around 100.000 qualifiers with mostly small values
 around
 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
 random
 access of 1-10 qualifiers per row.
 
 I would like to understand how HBase loads the data into memory.
 Will
 the
 entire row 

Re: Rowkey design question

2015-04-08 Thread Kristoffer Sjögren
I just read through HBase MOB design document and one thing that caught my
attention was the following statement.

When HBase deals with large numbers of values  100kb and up to ~10MB of
data, it encounters performance degradations due to write amplification
caused by splits and compactions.

Is there any chance to run into this problem in the read path for data that
is written infrequently and never changed?

On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren sto...@gmail.com wrote:

 A small set of qualifiers will be accessed frequently so keeping them in
 block cache would be very beneficial. Some very seldom. So this sounds very
 promising!

 The reason why i'm considering a coprocessor is that I need to provide
 very specific information in the query request. Same thing with the
 response. Queries are also highly parallelizable across rows and each
 individual query produce a valid result that may or may not be aggregated
 with other results in the client, maybe even inside the region if it
 contained multiple rows targeted by the query.

 So it's a bit like Phoenix but with a different storage format and query
 engine.

 On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk ndimi...@gmail.com wrote:

 Those rows are written out into HBase blocks on cell boundaries. Your
 column family has a BLOCK_SIZE attribute, which you may or may have no
 overridden the default of 64k. Cells are written into a block until is it
 = the target block size. So your single 500mb row will be broken down
 into
 thousands of HFile blocks in some number of HFiles. Some of those blocks
 may contain just a cell or two and be a couple MB in size, to hold the
 largest of your cells. Those blocks will be loaded into the Block Cache as
 they're accessed. If your careful with your access patterns and only
 request cells that you need to evaluate, you'll only ever load the blocks
 containing those cells into the cache.

  Will the entire row be loaded or only the qualifiers I ask for?

 So then, the answer to your question is: it depends on how you're
 interacting with the row from your coprocessor. The read path will only
 load blocks that your scanner requests. If your coprocessor is producing
 scanner with to seek to specific qualifiers, you'll only load those
 blocks.

 Related question: Is there a reason you're using a coprocessor instead of
 a
 regular filter, or a simple qualified get/scan to access data from these
 rows? The default stuff is already tuned to load data sparsely, as would
 be desirable for your schema.

 -n

 On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:

  Sorry I should have explained my use case a bit more.
 
  Yes, it's a pretty big row and it's close to worst case. Normally
 there
  would be fewer qualifiers and the largest qualifiers would be smaller.
 
  The reason why these rows gets big is because they stores aggregated
 data
  in indexed compressed form. This format allow for extremely fast queries
  (on local disk format) over billions of rows (not rows in HBase speak),
  when touching smaller areas of the data. If would store the data as
 regular
  HBase rows things would get very slow unless I had many many region
  servers.
 
  The coprocessor is used for doing custom queries on the indexed data
 inside
  the region servers. These queries are not like a regular row scan, but
 very
  specific as to how the data is formatted withing each column qualifier.
 
  Yes, this is not possible if HBase loads the whole 500MB each time i
 want
  to perform this custom query on a row. Hence my question :-)
 
 
 
 
  On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel 
 michael_se...@hotmail.com
  wrote:
 
   Sorry, but your initial problem statement doesn’t seem to parse …
  
   Are you saying that you a single row with approximately 100,000
 elements
   where each element is roughly 1-5KB in size and in addition there are
 ~5
   elements which will be between one and five MB in size?
  
   And you then mention a coprocessor?
  
   Just looking at the numbers… 100K * 5KB means that each row would end
 up
   being 500MB in size.
  
   That’s a pretty fat row.
  
   I would suggest rethinking your strategy.
  
On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren sto...@gmail.com
   wrote:
   
Hi
   
I have a row with around 100.000 qualifiers with mostly small values
   around
1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
 random
access of 1-10 qualifiers per row.
   
I would like to understand how HBase loads the data into memory.
 Will
  the
entire row be loaded or only the qualifiers I ask for (like pointer
   access
into a direct ByteBuffer) ?
   
Cheers,
-Kristoffer
  
   The opinions expressed here are mine, while they may reflect a
 cognitive
   thought, that is purely accidental.
   Use at your own risk.
   Michael Segel
   michael_segel (AT) hotmail.com
  
  
  
  
  
  
 





Re: Rowkey design question

2015-04-08 Thread Kristoffer Sjögren
A small set of qualifiers will be accessed frequently so keeping them in
block cache would be very beneficial. Some very seldom. So this sounds very
promising!

The reason why i'm considering a coprocessor is that I need to provide very
specific information in the query request. Same thing with the response.
Queries are also highly parallelizable across rows and each individual
query produce a valid result that may or may not be aggregated with other
results in the client, maybe even inside the region if it contained
multiple rows targeted by the query.

So it's a bit like Phoenix but with a different storage format and query
engine.

On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk ndimi...@gmail.com wrote:

 Those rows are written out into HBase blocks on cell boundaries. Your
 column family has a BLOCK_SIZE attribute, which you may or may have no
 overridden the default of 64k. Cells are written into a block until is it
 = the target block size. So your single 500mb row will be broken down into
 thousands of HFile blocks in some number of HFiles. Some of those blocks
 may contain just a cell or two and be a couple MB in size, to hold the
 largest of your cells. Those blocks will be loaded into the Block Cache as
 they're accessed. If your careful with your access patterns and only
 request cells that you need to evaluate, you'll only ever load the blocks
 containing those cells into the cache.

  Will the entire row be loaded or only the qualifiers I ask for?

 So then, the answer to your question is: it depends on how you're
 interacting with the row from your coprocessor. The read path will only
 load blocks that your scanner requests. If your coprocessor is producing
 scanner with to seek to specific qualifiers, you'll only load those blocks.

 Related question: Is there a reason you're using a coprocessor instead of a
 regular filter, or a simple qualified get/scan to access data from these
 rows? The default stuff is already tuned to load data sparsely, as would
 be desirable for your schema.

 -n

 On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:

  Sorry I should have explained my use case a bit more.
 
  Yes, it's a pretty big row and it's close to worst case. Normally there
  would be fewer qualifiers and the largest qualifiers would be smaller.
 
  The reason why these rows gets big is because they stores aggregated data
  in indexed compressed form. This format allow for extremely fast queries
  (on local disk format) over billions of rows (not rows in HBase speak),
  when touching smaller areas of the data. If would store the data as
 regular
  HBase rows things would get very slow unless I had many many region
  servers.
 
  The coprocessor is used for doing custom queries on the indexed data
 inside
  the region servers. These queries are not like a regular row scan, but
 very
  specific as to how the data is formatted withing each column qualifier.
 
  Yes, this is not possible if HBase loads the whole 500MB each time i want
  to perform this custom query on a row. Hence my question :-)
 
 
 
 
  On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel 
 michael_se...@hotmail.com
  wrote:
 
   Sorry, but your initial problem statement doesn’t seem to parse …
  
   Are you saying that you a single row with approximately 100,000
 elements
   where each element is roughly 1-5KB in size and in addition there are
 ~5
   elements which will be between one and five MB in size?
  
   And you then mention a coprocessor?
  
   Just looking at the numbers… 100K * 5KB means that each row would end
 up
   being 500MB in size.
  
   That’s a pretty fat row.
  
   I would suggest rethinking your strategy.
  
On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren sto...@gmail.com
   wrote:
   
Hi
   
I have a row with around 100.000 qualifiers with mostly small values
   around
1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random
access of 1-10 qualifiers per row.
   
I would like to understand how HBase loads the data into memory. Will
  the
entire row be loaded or only the qualifiers I ask for (like pointer
   access
into a direct ByteBuffer) ?
   
Cheers,
-Kristoffer
  
   The opinions expressed here are mine, while they may reflect a
 cognitive
   thought, that is purely accidental.
   Use at your own risk.
   Michael Segel
   michael_segel (AT) hotmail.com
  
  
  
  
  
  
 



Re: Rowkey design question

2015-04-08 Thread Kristoffer Sjögren
Yes, I think you're right. Adding one or more dimensions to the rowkey
would indeed make the table narrower.

And I guess it also make sense to store actual values (bigger qualifiers)
outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on SSD
caches would be an interesting solution. And quite a bit simpler.

Good call and thanks for the tip! :-)

On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel michael_se...@hotmail.com
wrote:

 Ok…

 First, I’d suggest you rethink your schema by adding an additional
 dimension.
 You’ll end up with more rows, but a narrower table.

 In terms of compaction… if the data is relatively static, you won’t have
 compactions because nothing changed.
 But if your data is that static… why not put the data in sequence files
 and use HBase as the index. Could be faster.

 HTH

 -Mike

  On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
  I just read through HBase MOB design document and one thing that caught
 my
  attention was the following statement.
 
  When HBase deals with large numbers of values  100kb and up to ~10MB of
  data, it encounters performance degradations due to write amplification
  caused by splits and compactions.
 
  Is there any chance to run into this problem in the read path for data
 that
  is written infrequently and never changed?
 
  On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
  A small set of qualifiers will be accessed frequently so keeping them in
  block cache would be very beneficial. Some very seldom. So this sounds
 very
  promising!
 
  The reason why i'm considering a coprocessor is that I need to provide
  very specific information in the query request. Same thing with the
  response. Queries are also highly parallelizable across rows and each
  individual query produce a valid result that may or may not be
 aggregated
  with other results in the client, maybe even inside the region if it
  contained multiple rows targeted by the query.
 
  So it's a bit like Phoenix but with a different storage format and query
  engine.
 
  On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk ndimi...@gmail.com
 wrote:
 
  Those rows are written out into HBase blocks on cell boundaries. Your
  column family has a BLOCK_SIZE attribute, which you may or may have no
  overridden the default of 64k. Cells are written into a block until is
 it
  = the target block size. So your single 500mb row will be broken down
  into
  thousands of HFile blocks in some number of HFiles. Some of those
 blocks
  may contain just a cell or two and be a couple MB in size, to hold the
  largest of your cells. Those blocks will be loaded into the Block
 Cache as
  they're accessed. If your careful with your access patterns and only
  request cells that you need to evaluate, you'll only ever load the
 blocks
  containing those cells into the cache.
 
  Will the entire row be loaded or only the qualifiers I ask for?
 
  So then, the answer to your question is: it depends on how you're
  interacting with the row from your coprocessor. The read path will only
  load blocks that your scanner requests. If your coprocessor is
 producing
  scanner with to seek to specific qualifiers, you'll only load those
  blocks.
 
  Related question: Is there a reason you're using a coprocessor instead
 of
  a
  regular filter, or a simple qualified get/scan to access data from
 these
  rows? The default stuff is already tuned to load data sparsely, as
 would
  be desirable for your schema.
 
  -n
 
  On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren sto...@gmail.com
  wrote:
 
  Sorry I should have explained my use case a bit more.
 
  Yes, it's a pretty big row and it's close to worst case. Normally
  there
  would be fewer qualifiers and the largest qualifiers would be smaller.
 
  The reason why these rows gets big is because they stores aggregated
  data
  in indexed compressed form. This format allow for extremely fast
 queries
  (on local disk format) over billions of rows (not rows in HBase
 speak),
  when touching smaller areas of the data. If would store the data as
  regular
  HBase rows things would get very slow unless I had many many region
  servers.
 
  The coprocessor is used for doing custom queries on the indexed data
  inside
  the region servers. These queries are not like a regular row scan, but
  very
  specific as to how the data is formatted withing each column
 qualifier.
 
  Yes, this is not possible if HBase loads the whole 500MB each time i
  want
  to perform this custom query on a row. Hence my question :-)
 
 
 
 
  On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel 
  michael_se...@hotmail.com
  wrote:
 
  Sorry, but your initial problem statement doesn’t seem to parse …
 
  Are you saying that you a single row with approximately 100,000
  elements
  where each element is roughly 1-5KB in size and in addition there are
  ~5
  elements which will be between one and five MB in size?
 
  And you then mention a 

Re: Rowkey design question

2015-04-08 Thread Michael Segel
I think you misunderstood. 

The suggestion was to put the data in to HDFS sequence files and to use HBase 
to store an index in to the file. (URL to the file, then offset in to the file 
for the start of the record…) 

The reason you want to do this is that you’re reading in large amounts of data 
and its more efficient to do this from HDFS than through HBase. 

 On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 Yes, I think you're right. Adding one or more dimensions to the rowkey
 would indeed make the table narrower.
 
 And I guess it also make sense to store actual values (bigger qualifiers)
 outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on SSD
 caches would be an interesting solution. And quite a bit simpler.
 
 Good call and thanks for the tip! :-)
 
 On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 Ok…
 
 First, I’d suggest you rethink your schema by adding an additional
 dimension.
 You’ll end up with more rows, but a narrower table.
 
 In terms of compaction… if the data is relatively static, you won’t have
 compactions because nothing changed.
 But if your data is that static… why not put the data in sequence files
 and use HBase as the index. Could be faster.
 
 HTH
 
 -Mike
 
 On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 I just read through HBase MOB design document and one thing that caught
 my
 attention was the following statement.
 
 When HBase deals with large numbers of values  100kb and up to ~10MB of
 data, it encounters performance degradations due to write amplification
 caused by splits and compactions.
 
 Is there any chance to run into this problem in the read path for data
 that
 is written infrequently and never changed?
 
 On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 A small set of qualifiers will be accessed frequently so keeping them in
 block cache would be very beneficial. Some very seldom. So this sounds
 very
 promising!
 
 The reason why i'm considering a coprocessor is that I need to provide
 very specific information in the query request. Same thing with the
 response. Queries are also highly parallelizable across rows and each
 individual query produce a valid result that may or may not be
 aggregated
 with other results in the client, maybe even inside the region if it
 contained multiple rows targeted by the query.
 
 So it's a bit like Phoenix but with a different storage format and query
 engine.
 
 On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk ndimi...@gmail.com
 wrote:
 
 Those rows are written out into HBase blocks on cell boundaries. Your
 column family has a BLOCK_SIZE attribute, which you may or may have no
 overridden the default of 64k. Cells are written into a block until is
 it
 = the target block size. So your single 500mb row will be broken down
 into
 thousands of HFile blocks in some number of HFiles. Some of those
 blocks
 may contain just a cell or two and be a couple MB in size, to hold the
 largest of your cells. Those blocks will be loaded into the Block
 Cache as
 they're accessed. If your careful with your access patterns and only
 request cells that you need to evaluate, you'll only ever load the
 blocks
 containing those cells into the cache.
 
 Will the entire row be loaded or only the qualifiers I ask for?
 
 So then, the answer to your question is: it depends on how you're
 interacting with the row from your coprocessor. The read path will only
 load blocks that your scanner requests. If your coprocessor is
 producing
 scanner with to seek to specific qualifiers, you'll only load those
 blocks.
 
 Related question: Is there a reason you're using a coprocessor instead
 of
 a
 regular filter, or a simple qualified get/scan to access data from
 these
 rows? The default stuff is already tuned to load data sparsely, as
 would
 be desirable for your schema.
 
 -n
 
 On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
 Sorry I should have explained my use case a bit more.
 
 Yes, it's a pretty big row and it's close to worst case. Normally
 there
 would be fewer qualifiers and the largest qualifiers would be smaller.
 
 The reason why these rows gets big is because they stores aggregated
 data
 in indexed compressed form. This format allow for extremely fast
 queries
 (on local disk format) over billions of rows (not rows in HBase
 speak),
 when touching smaller areas of the data. If would store the data as
 regular
 HBase rows things would get very slow unless I had many many region
 servers.
 
 The coprocessor is used for doing custom queries on the indexed data
 inside
 the region servers. These queries are not like a regular row scan, but
 very
 specific as to how the data is formatted withing each column
 qualifier.
 
 Yes, this is not possible if HBase loads the whole 500MB each time i
 want
 to perform this custom query on a row. Hence my question :-)
 
 
 
 
 On Tue, Apr 

Re: Rowkey design question

2015-04-08 Thread Kristoffer Sjögren
But if the coprocessor is omitted then CPU cycles from region servers are
lost, so where would the query execution go?

Queries needs to be quick (sub-second rather than seconds) and HDFS is
quite latency hungry, unless there are optimizations that i'm unaware of?



On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel michael_se...@hotmail.com
wrote:

 I think you misunderstood.

 The suggestion was to put the data in to HDFS sequence files and to use
 HBase to store an index in to the file. (URL to the file, then offset in to
 the file for the start of the record…)

 The reason you want to do this is that you’re reading in large amounts of
 data and its more efficient to do this from HDFS than through HBase.

  On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
  Yes, I think you're right. Adding one or more dimensions to the rowkey
  would indeed make the table narrower.
 
  And I guess it also make sense to store actual values (bigger qualifiers)
  outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on
 SSD
  caches would be an interesting solution. And quite a bit simpler.
 
  Good call and thanks for the tip! :-)
 
  On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel michael_se...@hotmail.com
 
  wrote:
 
  Ok…
 
  First, I’d suggest you rethink your schema by adding an additional
  dimension.
  You’ll end up with more rows, but a narrower table.
 
  In terms of compaction… if the data is relatively static, you won’t have
  compactions because nothing changed.
  But if your data is that static… why not put the data in sequence files
  and use HBase as the index. Could be faster.
 
  HTH
 
  -Mike
 
  On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
  I just read through HBase MOB design document and one thing that caught
  my
  attention was the following statement.
 
  When HBase deals with large numbers of values  100kb and up to ~10MB
 of
  data, it encounters performance degradations due to write amplification
  caused by splits and compactions.
 
  Is there any chance to run into this problem in the read path for data
  that
  is written infrequently and never changed?
 
  On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren sto...@gmail.com
  wrote:
 
  A small set of qualifiers will be accessed frequently so keeping them
 in
  block cache would be very beneficial. Some very seldom. So this sounds
  very
  promising!
 
  The reason why i'm considering a coprocessor is that I need to provide
  very specific information in the query request. Same thing with the
  response. Queries are also highly parallelizable across rows and each
  individual query produce a valid result that may or may not be
  aggregated
  with other results in the client, maybe even inside the region if it
  contained multiple rows targeted by the query.
 
  So it's a bit like Phoenix but with a different storage format and
 query
  engine.
 
  On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk ndimi...@gmail.com
  wrote:
 
  Those rows are written out into HBase blocks on cell boundaries. Your
  column family has a BLOCK_SIZE attribute, which you may or may have
 no
  overridden the default of 64k. Cells are written into a block until
 is
  it
  = the target block size. So your single 500mb row will be broken
 down
  into
  thousands of HFile blocks in some number of HFiles. Some of those
  blocks
  may contain just a cell or two and be a couple MB in size, to hold
 the
  largest of your cells. Those blocks will be loaded into the Block
  Cache as
  they're accessed. If your careful with your access patterns and only
  request cells that you need to evaluate, you'll only ever load the
  blocks
  containing those cells into the cache.
 
  Will the entire row be loaded or only the qualifiers I ask for?
 
  So then, the answer to your question is: it depends on how you're
  interacting with the row from your coprocessor. The read path will
 only
  load blocks that your scanner requests. If your coprocessor is
  producing
  scanner with to seek to specific qualifiers, you'll only load those
  blocks.
 
  Related question: Is there a reason you're using a coprocessor
 instead
  of
  a
  regular filter, or a simple qualified get/scan to access data from
  these
  rows? The default stuff is already tuned to load data sparsely, as
  would
  be desirable for your schema.
 
  -n
 
  On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren sto...@gmail.com
 
  wrote:
 
  Sorry I should have explained my use case a bit more.
 
  Yes, it's a pretty big row and it's close to worst case. Normally
  there
  would be fewer qualifiers and the largest qualifiers would be
 smaller.
 
  The reason why these rows gets big is because they stores aggregated
  data
  in indexed compressed form. This format allow for extremely fast
  queries
  (on local disk format) over billions of rows (not rows in HBase
  speak),
  when touching smaller areas of the data. If would store the data as
  regular
  HBase rows 

Re: Rowkey design question

2015-04-07 Thread Imants Cekusins
 how HBase loads the data into memory.

If you init Get and specify columns with addColumn, it is likely that only
data for these columns is read and loaded in memory.

Rowkey is best kept short. So are column qualifiers.


Re: Rowkey design question

2015-04-07 Thread Kristoffer Sjögren
Sorry I should have explained my use case a bit more.

Yes, it's a pretty big row and it's close to worst case. Normally there
would be fewer qualifiers and the largest qualifiers would be smaller.

The reason why these rows gets big is because they stores aggregated data
in indexed compressed form. This format allow for extremely fast queries
(on local disk format) over billions of rows (not rows in HBase speak),
when touching smaller areas of the data. If would store the data as regular
HBase rows things would get very slow unless I had many many region servers.

The coprocessor is used for doing custom queries on the indexed data inside
the region servers. These queries are not like a regular row scan, but very
specific as to how the data is formatted withing each column qualifier.

Yes, this is not possible if HBase loads the whole 500MB each time i want
to perform this custom query on a row. Hence my question :-)




On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel michael_se...@hotmail.com
wrote:

 Sorry, but your initial problem statement doesn’t seem to parse …

 Are you saying that you a single row with approximately 100,000 elements
 where each element is roughly 1-5KB in size and in addition there are ~5
 elements which will be between one and five MB in size?

 And you then mention a coprocessor?

 Just looking at the numbers… 100K * 5KB means that each row would end up
 being 500MB in size.

 That’s a pretty fat row.

 I would suggest rethinking your strategy.

  On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren sto...@gmail.com
 wrote:
 
  Hi
 
  I have a row with around 100.000 qualifiers with mostly small values
 around
  1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random
  access of 1-10 qualifiers per row.
 
  I would like to understand how HBase loads the data into memory. Will the
  entire row be loaded or only the qualifiers I ask for (like pointer
 access
  into a direct ByteBuffer) ?
 
  Cheers,
  -Kristoffer

 The opinions expressed here are mine, while they may reflect a cognitive
 thought, that is purely accidental.
 Use at your own risk.
 Michael Segel
 michael_segel (AT) hotmail.com








Re: Rowkey design question

2015-04-07 Thread Michael Segel
Sorry, but your initial problem statement doesn’t seem to parse … 

Are you saying that you a single row with approximately 100,000 elements where 
each element is roughly 1-5KB in size and in addition there are ~5 elements 
which will be between one and five MB in size? 

And you then mention a coprocessor? 

Just looking at the numbers… 100K * 5KB means that each row would end up being 
500MB in size. 

That’s a pretty fat row.

I would suggest rethinking your strategy. 

 On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren sto...@gmail.com wrote:
 
 Hi
 
 I have a row with around 100.000 qualifiers with mostly small values around
 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random
 access of 1-10 qualifiers per row.
 
 I would like to understand how HBase loads the data into memory. Will the
 entire row be loaded or only the qualifiers I ask for (like pointer access
 into a direct ByteBuffer) ?
 
 Cheers,
 -Kristoffer

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com







Re: Rowkey design question

2015-04-07 Thread Nick Dimiduk
Those rows are written out into HBase blocks on cell boundaries. Your
column family has a BLOCK_SIZE attribute, which you may or may have no
overridden the default of 64k. Cells are written into a block until is it
= the target block size. So your single 500mb row will be broken down into
thousands of HFile blocks in some number of HFiles. Some of those blocks
may contain just a cell or two and be a couple MB in size, to hold the
largest of your cells. Those blocks will be loaded into the Block Cache as
they're accessed. If your careful with your access patterns and only
request cells that you need to evaluate, you'll only ever load the blocks
containing those cells into the cache.

 Will the entire row be loaded or only the qualifiers I ask for?

So then, the answer to your question is: it depends on how you're
interacting with the row from your coprocessor. The read path will only
load blocks that your scanner requests. If your coprocessor is producing
scanner with to seek to specific qualifiers, you'll only load those blocks.

Related question: Is there a reason you're using a coprocessor instead of a
regular filter, or a simple qualified get/scan to access data from these
rows? The default stuff is already tuned to load data sparsely, as would
be desirable for your schema.

-n

On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren sto...@gmail.com wrote:

 Sorry I should have explained my use case a bit more.

 Yes, it's a pretty big row and it's close to worst case. Normally there
 would be fewer qualifiers and the largest qualifiers would be smaller.

 The reason why these rows gets big is because they stores aggregated data
 in indexed compressed form. This format allow for extremely fast queries
 (on local disk format) over billions of rows (not rows in HBase speak),
 when touching smaller areas of the data. If would store the data as regular
 HBase rows things would get very slow unless I had many many region
 servers.

 The coprocessor is used for doing custom queries on the indexed data inside
 the region servers. These queries are not like a regular row scan, but very
 specific as to how the data is formatted withing each column qualifier.

 Yes, this is not possible if HBase loads the whole 500MB each time i want
 to perform this custom query on a row. Hence my question :-)




 On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel michael_se...@hotmail.com
 wrote:

  Sorry, but your initial problem statement doesn’t seem to parse …
 
  Are you saying that you a single row with approximately 100,000 elements
  where each element is roughly 1-5KB in size and in addition there are ~5
  elements which will be between one and five MB in size?
 
  And you then mention a coprocessor?
 
  Just looking at the numbers… 100K * 5KB means that each row would end up
  being 500MB in size.
 
  That’s a pretty fat row.
 
  I would suggest rethinking your strategy.
 
   On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren sto...@gmail.com
  wrote:
  
   Hi
  
   I have a row with around 100.000 qualifiers with mostly small values
  around
   1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random
   access of 1-10 qualifiers per row.
  
   I would like to understand how HBase loads the data into memory. Will
 the
   entire row be loaded or only the qualifiers I ask for (like pointer
  access
   into a direct ByteBuffer) ?
  
   Cheers,
   -Kristoffer
 
  The opinions expressed here are mine, while they may reflect a cognitive
  thought, that is purely accidental.
  Use at your own risk.
  Michael Segel
  michael_segel (AT) hotmail.com
 
 
 
 
 
 



Re: Rowkey design question

2013-02-21 Thread Asaf Mesika
An easier way is to place one byte before the time stamp which is called a
bucket. You can calculate it by using modulu on the time stamp by the
number of buckets. We are now in the process of field testing it.


On Tuesday, February 19, 2013, Paul van Hoven wrote:

 Yeah it worked fine.

 But as I understand: If I prefix my row key with something like

 md5-hash + timestamp

 then the rowkeys are probably evenly distributed but how would I
 perform then a scan restricted to a special time range?


 2013/2/19 Mohammad Tariq donta...@gmail.com javascript:;:
  No. before the timestamp. All the row keys which are identical go to the
  same region. This is the default Hbase behavior and is meant to make the
  performance better. But sometimes the machine gets overloaded with reads
  and writes because we get concentrated on that particular machine. For
  example timeseries data. So it's better to hash the keys in order to make
  them go to all the machines equally. HTH
 
  BTW, did that range query work??
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
  cloudfront.blogspot.com
 
 
  On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven 
  paul.van.ho...@googlemail.com wrote:
 
  Hey Tariq,
 
  thanks for your quick answer. I'm not sure if I got the idea in the
  seond part of your answer. You mean if I use a timestamp as a rowkey I
  should append a hash like this:
 
  135727920+MD5HASH
 
  and then the data would be distributed more equally?
 
 
  2013/2/19 Mohammad Tariq donta...@gmail.com:
   Hello Paul,
  
   Try this and see if it works :
  scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
  scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));
  
   Also try not to use TS as the rowkey, as it may lead to RS
 hotspotting.
   Just add a hash to your rowkeys so that data is distributed evenly on
 all
   the RSs.
  
   Warm Regards,
   Tariq
   https://mtariq.jux.com/
   cloudfront.blogspot.com
  
  
   On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
   paul.van.ho...@googlemail.com wrote:
  
   Hi,
  
   I'm currently playing with hbase. The design of the rowkey seems to
 be
   critical.
  
   The rowkey for a certain database table of mine is:
  
   timestamp+ipaddress
  
   It looks something like this when performing a scan on the table in
 the
   shell:
   hbase(main):012:0 scan 'ToyDataTable'
   ROW COLUMN+CELL
135702000+192.168.178.9column=CF:SampleCol,
   timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
  
   Since I got several rows for different timestamps I'd like to tell a
   scan to just a region of the table for example from 2013-01-07 to
   2013-01-09. Previously I only had a timestamp as the rowkey and I
   could restrict the rowkey like that:
  
   SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd
  HH:mm:ss);
   Date startDate = formatter.parse(2013-01-07
   07:00:00);
   Date endDate = formatter.parse(2013-01-10
   07:00:00);
  
   HTableInterface toyDataTable =
   pool.getTable(ToyDataTable);
   Scan scan = new Scan( Bytes.toBytes(
   startDate.getTime() ),
   Bytes.toBytes( endDate.getTime() ) );
  
  


Re: Rowkey design question

2013-02-21 Thread Mohammad Tariq
Another good point.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Fri, Feb 22, 2013 at 3:45 AM, Asaf Mesika asaf.mes...@gmail.com wrote:

 An easier way is to place one byte before the time stamp which is called a
 bucket. You can calculate it by using modulu on the time stamp by the
 number of buckets. We are now in the process of field testing it.


 On Tuesday, February 19, 2013, Paul van Hoven wrote:

  Yeah it worked fine.
 
  But as I understand: If I prefix my row key with something like
 
  md5-hash + timestamp
 
  then the rowkeys are probably evenly distributed but how would I
  perform then a scan restricted to a special time range?
 
 
  2013/2/19 Mohammad Tariq donta...@gmail.com javascript:;:
   No. before the timestamp. All the row keys which are identical go to
 the
   same region. This is the default Hbase behavior and is meant to make
 the
   performance better. But sometimes the machine gets overloaded with
 reads
   and writes because we get concentrated on that particular machine. For
   example timeseries data. So it's better to hash the keys in order to
 make
   them go to all the machines equally. HTH
  
   BTW, did that range query work??
  
   Warm Regards,
   Tariq
   https://mtariq.jux.com/
   cloudfront.blogspot.com
  
  
   On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven 
   paul.van.ho...@googlemail.com wrote:
  
   Hey Tariq,
  
   thanks for your quick answer. I'm not sure if I got the idea in the
   seond part of your answer. You mean if I use a timestamp as a rowkey I
   should append a hash like this:
  
   135727920+MD5HASH
  
   and then the data would be distributed more equally?
  
  
   2013/2/19 Mohammad Tariq donta...@gmail.com:
Hello Paul,
   
Try this and see if it works :
   scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
   scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));
   
Also try not to use TS as the rowkey, as it may lead to RS
  hotspotting.
Just add a hash to your rowkeys so that data is distributed evenly
 on
  all
the RSs.
   
Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com
   
   
On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:
   
Hi,
   
I'm currently playing with hbase. The design of the rowkey seems to
  be
critical.
   
The rowkey for a certain database table of mine is:
   
timestamp+ipaddress
   
It looks something like this when performing a scan on the table in
  the
shell:
hbase(main):012:0 scan 'ToyDataTable'
ROW COLUMN+CELL
 135702000+192.168.178.9column=CF:SampleCol,
timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
   
Since I got several rows for different timestamps I'd like to tell
 a
scan to just a region of the table for example from 2013-01-07 to
2013-01-09. Previously I only had a timestamp as the rowkey and I
could restrict the rowkey like that:
   
SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd
   HH:mm:ss);
Date startDate =
 formatter.parse(2013-01-07
07:00:00);
Date endDate = formatter.parse(2013-01-10
07:00:00);
   
HTableInterface toyDataTable =
pool.getTable(ToyDataTable);
Scan scan = new Scan( Bytes.toBytes(
startDate.getTime() ),
Bytes.toBytes( endDate.getTime() ) );
   
   



Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
Hello Paul,

Try this and see if it works :
   scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
   scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));

Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
Just add a hash to your rowkeys so that data is distributed evenly on all
the RSs.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:

 Hi,

 I'm currently playing with hbase. The design of the rowkey seems to be
 critical.

 The rowkey for a certain database table of mine is:

 timestamp+ipaddress

 It looks something like this when performing a scan on the table in the
 shell:
 hbase(main):012:0 scan 'ToyDataTable'
 ROW COLUMN+CELL
  135702000+192.168.178.9column=CF:SampleCol,
 timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00

 Since I got several rows for different timestamps I'd like to tell a
 scan to just a region of the table for example from 2013-01-07 to
 2013-01-09. Previously I only had a timestamp as the rowkey and I
 could restrict the rowkey like that:

 SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss);
 Date startDate = formatter.parse(2013-01-07
 07:00:00);
 Date endDate = formatter.parse(2013-01-10
 07:00:00);

 HTableInterface toyDataTable =
 pool.getTable(ToyDataTable);
 Scan scan = new Scan( Bytes.toBytes(
 startDate.getTime() ),
 Bytes.toBytes( endDate.getTime() ) );

 But this no longer works with my new design.

 Is there a way to tell the scan object to filter the rows with respect
 to the timestamp, or do I have to use a filter object?



Re: Rowkey design question

2013-02-19 Thread Paul van Hoven
Hey Tariq,

thanks for your quick answer. I'm not sure if I got the idea in the
seond part of your answer. You mean if I use a timestamp as a rowkey I
should append a hash like this:

135727920+MD5HASH

and then the data would be distributed more equally?


2013/2/19 Mohammad Tariq donta...@gmail.com:
 Hello Paul,

 Try this and see if it works :
scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));

 Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
 Just add a hash to your rowkeys so that data is distributed evenly on all
 the RSs.

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
 paul.van.ho...@googlemail.com wrote:

 Hi,

 I'm currently playing with hbase. The design of the rowkey seems to be
 critical.

 The rowkey for a certain database table of mine is:

 timestamp+ipaddress

 It looks something like this when performing a scan on the table in the
 shell:
 hbase(main):012:0 scan 'ToyDataTable'
 ROW COLUMN+CELL
  135702000+192.168.178.9column=CF:SampleCol,
 timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00

 Since I got several rows for different timestamps I'd like to tell a
 scan to just a region of the table for example from 2013-01-07 to
 2013-01-09. Previously I only had a timestamp as the rowkey and I
 could restrict the rowkey like that:

 SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd HH:mm:ss);
 Date startDate = formatter.parse(2013-01-07
 07:00:00);
 Date endDate = formatter.parse(2013-01-10
 07:00:00);

 HTableInterface toyDataTable =
 pool.getTable(ToyDataTable);
 Scan scan = new Scan( Bytes.toBytes(
 startDate.getTime() ),
 Bytes.toBytes( endDate.getTime() ) );

 But this no longer works with my new design.

 Is there a way to tell the scan object to filter the rows with respect
 to the timestamp, or do I have to use a filter object?



Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
No. before the timestamp. All the row keys which are identical go to the
same region. This is the default Hbase behavior and is meant to make the
performance better. But sometimes the machine gets overloaded with reads
and writes because we get concentrated on that particular machine. For
example timeseries data. So it's better to hash the keys in order to make
them go to all the machines equally. HTH

BTW, did that range query work??

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:

 Hey Tariq,

 thanks for your quick answer. I'm not sure if I got the idea in the
 seond part of your answer. You mean if I use a timestamp as a rowkey I
 should append a hash like this:

 135727920+MD5HASH

 and then the data would be distributed more equally?


 2013/2/19 Mohammad Tariq donta...@gmail.com:
  Hello Paul,
 
  Try this and see if it works :
 scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
 scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));
 
  Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
  Just add a hash to your rowkeys so that data is distributed evenly on all
  the RSs.
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
  cloudfront.blogspot.com
 
 
  On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
  paul.van.ho...@googlemail.com wrote:
 
  Hi,
 
  I'm currently playing with hbase. The design of the rowkey seems to be
  critical.
 
  The rowkey for a certain database table of mine is:
 
  timestamp+ipaddress
 
  It looks something like this when performing a scan on the table in the
  shell:
  hbase(main):012:0 scan 'ToyDataTable'
  ROW COLUMN+CELL
   135702000+192.168.178.9column=CF:SampleCol,
  timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
 
  Since I got several rows for different timestamps I'd like to tell a
  scan to just a region of the table for example from 2013-01-07 to
  2013-01-09. Previously I only had a timestamp as the rowkey and I
  could restrict the rowkey like that:
 
  SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd
 HH:mm:ss);
  Date startDate = formatter.parse(2013-01-07
  07:00:00);
  Date endDate = formatter.parse(2013-01-10
  07:00:00);
 
  HTableInterface toyDataTable =
  pool.getTable(ToyDataTable);
  Scan scan = new Scan( Bytes.toBytes(
  startDate.getTime() ),
  Bytes.toBytes( endDate.getTime() ) );
 
  But this no longer works with my new design.
 
  Is there a way to tell the scan object to filter the rows with respect
  to the timestamp, or do I have to use a filter object?
 



Re: Rowkey design question

2013-02-19 Thread Paul van Hoven
Yeah it worked fine.

But as I understand: If I prefix my row key with something like

md5-hash + timestamp

then the rowkeys are probably evenly distributed but how would I
perform then a scan restricted to a special time range?


2013/2/19 Mohammad Tariq donta...@gmail.com:
 No. before the timestamp. All the row keys which are identical go to the
 same region. This is the default Hbase behavior and is meant to make the
 performance better. But sometimes the machine gets overloaded with reads
 and writes because we get concentrated on that particular machine. For
 example timeseries data. So it's better to hash the keys in order to make
 them go to all the machines equally. HTH

 BTW, did that range query work??

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven 
 paul.van.ho...@googlemail.com wrote:

 Hey Tariq,

 thanks for your quick answer. I'm not sure if I got the idea in the
 seond part of your answer. You mean if I use a timestamp as a rowkey I
 should append a hash like this:

 135727920+MD5HASH

 and then the data would be distributed more equally?


 2013/2/19 Mohammad Tariq donta...@gmail.com:
  Hello Paul,
 
  Try this and see if it works :
 scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
 scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));
 
  Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
  Just add a hash to your rowkeys so that data is distributed evenly on all
  the RSs.
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
  cloudfront.blogspot.com
 
 
  On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
  paul.van.ho...@googlemail.com wrote:
 
  Hi,
 
  I'm currently playing with hbase. The design of the rowkey seems to be
  critical.
 
  The rowkey for a certain database table of mine is:
 
  timestamp+ipaddress
 
  It looks something like this when performing a scan on the table in the
  shell:
  hbase(main):012:0 scan 'ToyDataTable'
  ROW COLUMN+CELL
   135702000+192.168.178.9column=CF:SampleCol,
  timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
 
  Since I got several rows for different timestamps I'd like to tell a
  scan to just a region of the table for example from 2013-01-07 to
  2013-01-09. Previously I only had a timestamp as the rowkey and I
  could restrict the rowkey like that:
 
  SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd
 HH:mm:ss);
  Date startDate = formatter.parse(2013-01-07
  07:00:00);
  Date endDate = formatter.parse(2013-01-10
  07:00:00);
 
  HTableInterface toyDataTable =
  pool.getTable(ToyDataTable);
  Scan scan = new Scan( Bytes.toBytes(
  startDate.getTime() ),
  Bytes.toBytes( endDate.getTime() ) );
 
  But this no longer works with my new design.
 
  Is there a way to tell the scan object to filter the rows with respect
  to the timestamp, or do I have to use a filter object?
 



Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
You can use 
FuzzyRowFilterhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FuzzyRowFilter.htmlto
do that.

Have a look at this
linkhttp://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/.
You might find it helpful.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Feb 19, 2013 at 11:20 PM, Paul van Hoven 
paul.van.ho...@googlemail.com wrote:

 Yeah it worked fine.

 But as I understand: If I prefix my row key with something like

 md5-hash + timestamp

 then the rowkeys are probably evenly distributed but how would I
 perform then a scan restricted to a special time range?


 2013/2/19 Mohammad Tariq donta...@gmail.com:
  No. before the timestamp. All the row keys which are identical go to the
  same region. This is the default Hbase behavior and is meant to make the
  performance better. But sometimes the machine gets overloaded with reads
  and writes because we get concentrated on that particular machine. For
  example timeseries data. So it's better to hash the keys in order to make
  them go to all the machines equally. HTH
 
  BTW, did that range query work??
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
  cloudfront.blogspot.com
 
 
  On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven 
  paul.van.ho...@googlemail.com wrote:
 
  Hey Tariq,
 
  thanks for your quick answer. I'm not sure if I got the idea in the
  seond part of your answer. You mean if I use a timestamp as a rowkey I
  should append a hash like this:
 
  135727920+MD5HASH
 
  and then the data would be distributed more equally?
 
 
  2013/2/19 Mohammad Tariq donta...@gmail.com:
   Hello Paul,
  
   Try this and see if it works :
  scan.setStartRow(Bytes.toBytes(startDate.getTime() + ));
  scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ));
  
   Also try not to use TS as the rowkey, as it may lead to RS
 hotspotting.
   Just add a hash to your rowkeys so that data is distributed evenly on
 all
   the RSs.
  
   Warm Regards,
   Tariq
   https://mtariq.jux.com/
   cloudfront.blogspot.com
  
  
   On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven 
   paul.van.ho...@googlemail.com wrote:
  
   Hi,
  
   I'm currently playing with hbase. The design of the rowkey seems to
 be
   critical.
  
   The rowkey for a certain database table of mine is:
  
   timestamp+ipaddress
  
   It looks something like this when performing a scan on the table in
 the
   shell:
   hbase(main):012:0 scan 'ToyDataTable'
   ROW COLUMN+CELL
135702000+192.168.178.9column=CF:SampleCol,
   timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
  
   Since I got several rows for different timestamps I'd like to tell a
   scan to just a region of the table for example from 2013-01-07 to
   2013-01-09. Previously I only had a timestamp as the rowkey and I
   could restrict the rowkey like that:
  
   SimpleDateFormat formatter = new SimpleDateFormat(-MM-dd
  HH:mm:ss);
   Date startDate = formatter.parse(2013-01-07
   07:00:00);
   Date endDate = formatter.parse(2013-01-10
   07:00:00);
  
   HTableInterface toyDataTable =
   pool.getTable(ToyDataTable);
   Scan scan = new Scan( Bytes.toBytes(
   startDate.getTime() ),
   Bytes.toBytes( endDate.getTime() ) );
  
   But this no longer works with my new design.
  
   Is there a way to tell the scan object to filter the rows with
 respect
   to the timestamp, or do I have to use a filter object?