Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-09 Thread Michael Segel
Locality? 

Then the data should be in the same column family.  That’s as local as you can 
get. 

I would suggest that you think of the following:

What’s the predominant use case? 
How are you querying the data. 
If you’re always hitting multiple CFs to get the data… then you should have it 
in the same table. 

I think more people would benefit if they took more time thinking about their 
design and how the data is being used and stored… it would help. 
Also knowing that there really isn’t a single ‘right’ answer. Just a lot of 
wrong ones. ;-) 


Most people still try to think of HBase in terms of relational modeling and not 
in terms of records and more of a hierarchial system. 
Things like CFs and Versioning are often misused because people see them as 
shortcuts. 

Also people tend not to think of their data in HBase in terms of 3D but in 
terms of 2D. 
(CF’s would be 2+D) 

The one question which really hasn’t been answered is how fat is fat in terms 
of a row’s width and when is it too fat? 
This may seem like a simple thing, but it can impact a couple of things in your 
design. (I never got a good answer, and its one of those questions that if your 
wife were to ask if the pants she’s wearing makes her fat, its time to run for 
the hills because you can’t win no matter how you answer!) 
Seriously though, the optimal width of the column is not that easy to answer 
and sometimes you have to just guess as to which would be a better design. 

One of the problems with CFs is that if there’s an imbalance in terms of the 
size of data being stored in each CF, you can run in to issues. 
CFs are stored in separate files and split when the base CF splits. (Assuming 
you have a base CF and then multiple CFs that are related but store smaller 
records per row.) 
And then there’s the issue in terms of each CF is stored separately. (If memory 
serves its a separate file per CF, but right now my last living brain cell 
decided to call it quits and went on strike for more beer.) 
[Damn you last brain cell!!!] :-) 

Again the idea is to follow KISS. 

HTH

-Mike

On Sep 8, 2014, at 7:17 AM, Jianshi Huang jianshi.hu...@gmail.com wrote:

 Locality is important, that why I chose CF to put related data into one
 group. I can surely put the CF part to the head of rowkey to achieve
 similar result, but since the number of types is fixed, I don't any benefit
 doing that.
 
 With the setLoadColumnFamiliesOnDemand I learned from Ted, looks like the
 performance should be similar.
 
 Am I missing something? Please enlighten me.
 
 Jianshi
 
 On Mon, Sep 8, 2014 at 3:41 AM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 I would suggest rethinking column families and look at your potential for
 a slightly different row key.
 
 Going with column families doesn’t really make sense.
 
 Also how wide are the rows? (worst case?)
 
 one idea is to make type part of the RK…
 
 HTH
 
 -Mike
 
 On Sep 7, 2014, at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote:
 
 Hi Michael,
 
 Thanks for the questions.
 
 I'm modeling dynamic Graphs in HBase, all elements (vertices, edges)
 have a
 timestamp and I can query things like events between A and B for the
 last 7
 days.
 
 CFs are used for grouping different types of data for the same account.
 However, I have lots of skews in the data, to avoid having too much for
 the
 same row, I had to put what was in CQs to now RKs. So CF now acts more
 like
 a table.
 
 There's one CF containing sequence of events ordered by timestamp, and
 this
 CF is quite different as the use case is mostly in mapreduce jobs.
 
 Jianshi
 
 
 
 
 On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel michael_se...@hotmail.com
 
 wrote:
 
 Again, a silly question.
 
 Why are you using column families?
 
 Just to play devil’s advocate in terms of design, why are you not
 treating
 your row as a record? Think hierarchal not relational.
 
 This really gets in to some design theory.
 
 Think Column Family as a way to group data that has the same row key,
 reference the same thing, yet the data in each column family is used
 separately.
 The example I always turn to when teaching, is to think of an order
 entry
 system at a retailer.
 
 You generate data which is segmented by business process. (order entry,
 pick slips, shipping, invoicing) All reflect a single order, yet the
 data
 in each process tends to be accessed separately.
 (You don’t need the order entry when using the pick slip to pull orders
 from the warehouse.)  So here, the data access pattern is that each
 column
 family is used separately, except in generating the data (the order
 entry
 is used to generate the pick slip(s) and set up things like backorders
 and
 then the pick process generates the shipping slip(s) etc …  And since
 they
 are all focused on the same order, they have the same row key.
 
 So its reasonable to ask how you are accessing the data and how you are
 designing your HBase model?
 
 Many times,  developers create a model using 

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-07 Thread Michael Segel
I would suggest rethinking column families and look at your potential for a 
slightly different row key. 

Going with column families doesn’t really make sense. 

Also how wide are the rows? (worst case?) 

one idea is to make type part of the RK… 

HTH

-Mike

On Sep 7, 2014, at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote:

 Hi Michael,
 
 Thanks for the questions.
 
 I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a
 timestamp and I can query things like events between A and B for the last 7
 days.
 
 CFs are used for grouping different types of data for the same account.
 However, I have lots of skews in the data, to avoid having too much for the
 same row, I had to put what was in CQs to now RKs. So CF now acts more like
 a table.
 
 There's one CF containing sequence of events ordered by timestamp, and this
 CF is quite different as the use case is mostly in mapreduce jobs.
 
 Jianshi
 
 
 
 
 On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 Again, a silly question.
 
 Why are you using column families?
 
 Just to play devil’s advocate in terms of design, why are you not treating
 your row as a record? Think hierarchal not relational.
 
 This really gets in to some design theory.
 
 Think Column Family as a way to group data that has the same row key,
 reference the same thing, yet the data in each column family is used
 separately.
 The example I always turn to when teaching, is to think of an order entry
 system at a retailer.
 
 You generate data which is segmented by business process. (order entry,
 pick slips, shipping, invoicing) All reflect a single order, yet the data
 in each process tends to be accessed separately.
 (You don’t need the order entry when using the pick slip to pull orders
 from the warehouse.)  So here, the data access pattern is that each column
 family is used separately, except in generating the data (the order entry
 is used to generate the pick slip(s) and set up things like backorders and
 then the pick process generates the shipping slip(s) etc …  And since they
 are all focused on the same order, they have the same row key.
 
 So its reasonable to ask how you are accessing the data and how you are
 designing your HBase model?
 
 Many times,  developers create a model using column families because the
 developer is thinking in terms of relationships. Not access patterns on the
 data.
 
 Does this make sense?
 
 
 On Sep 6, 2014, at 7:46 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:
 
 BTW, a little explanation about the binning I mentioned.
 
 Currently the rowkey looks like type_of_events#rev_timestamp#id.
 
 And with binning, it looks like
 bin_number#type_of_events#rev_timestamp#id. The bin_number could
 be
 id % 256 or timestamp % 256. And the table could be pre-splitted. So
 future
 ingestions could do parallel insertion to #bin regions, even without
 pre-split.
 
 
 Jianshi
 
 
 On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:
 
 Each range might span multiple regions, depending on the data size I
 want
 scan for MR jobs.
 
 The ranges are dynamic, specified by the user, but the number of bins
 can
 be static (when the table/schema is created).
 
 Jianshi
 
 
 On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote:
 
 bq. 16 to 256 ranges
 
 Would each range be within single region or the range may span regions
 ?
 Are the ranges dynamic ?
 
 Using command line for multiple ranges would be out of question. A file
 with ranges is needed.
 
 Cheers
 
 
 On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang 
 jianshi.hu...@gmail.com
 wrote:
 
 Thanks Ted for the reference.
 
 That's right, extend the row.start and row.end to specify multiple
 ranges
 and also getSplits.
 
 I would probably bin the event sequence CF into 16 to 256 bins. So 16
 to
 256 ranges.
 
 Jianshi
 
 
 
 On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote:
 
 Please refer to HBASE-5416 Filter on one CF and if a match, then load
 and
 return full row
 
 bq. to extend TableInputFormat to accept multiple row ranges
 
 You mean extending hbase.mapreduce.scan.row.start and
 hbase.mapreduce.scan.row.stop so that multiple ranges can be
 specified ?
 How many such ranges do you normally need ?
 
 Cheers
 
 
 On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang 
 jianshi.hu...@gmail.com
 wrote:
 
 Thanks Ted,
 
 I'll pre-split the table during ingestion. The reason to keep the
 rowkey
 monotonic is for easier working with TableInputFormat, otherwise I
 would've
 binned it into 256 splits. (well, I think a good way is to extend
 TableInputFormat to accept multiple row ranges, if there's an
 existing
 efficient implementation, please let me know :)
 
 Would you elaborate a little more on the heap memory usage during
 scan?
 Is
 there any reference to that?
 
 Jianshi
 
 
 
 On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote:
 
 If you use monotonically increasing rowkeys, separating 

One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Hi,

I'm currently putting everything into one table (to make cross reference
queries easier) and there's one CF which contains rowkeys very different to
the rest. Currently it works well, but I'm wondering if it will cause
performance issues in the future.

So my questions are

1) will there be performance penalties in the way I'm doing?
2) should I move that CF to a separate table?


Thanks,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Ted Yu
Is this the same table you mentioned in the thread about RegionTooBusyException
?

If you move the column family to another table, you may have to handle
atomicity yourself - currently atomic operations are within region
boundaries.

Cheers


On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi,

 I'm currently putting everything into one table (to make cross reference
 queries easier) and there's one CF which contains rowkeys very different to
 the rest. Currently it works well, but I'm wondering if it will cause
 performance issues in the future.

 So my questions are

 1) will there be performance penalties in the way I'm doing?
 2) should I move that CF to a separate table?


 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Hi Ted,

Yes, that's the table having RegionTooBusyExceptions :) But the performance
I care most are scan performance.

It's mostly for analytics, so I don't care much about atomicity currently.

What's your suggestion?

Jianshi


On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote:

 Is this the same table you mentioned in the thread about
 RegionTooBusyException
 ?

 If you move the column family to another table, you may have to handle
 atomicity yourself - currently atomic operations are within region
 boundaries.

 Cheers


 On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

  Hi,
 
  I'm currently putting everything into one table (to make cross reference
  queries easier) and there's one CF which contains rowkeys very different
 to
  the rest. Currently it works well, but I'm wondering if it will cause
  performance issues in the future.
 
  So my questions are
 
  1) will there be performance penalties in the way I'm doing?
  2) should I move that CF to a separate table?
 
 
  Thanks,
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Well, write performance is also important... I'll probably ingest 1k~10k
records/second.

Jianshi


On Sun, Sep 7, 2014 at 1:11 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi Ted,

 Yes, that's the table having RegionTooBusyExceptions :) But the
 performance I care most are scan performance.

 It's mostly for analytics, so I don't care much about atomicity currently.

 What's your suggestion?

 Jianshi


 On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote:

 Is this the same table you mentioned in the thread about
 RegionTooBusyException
 ?

 If you move the column family to another table, you may have to handle
 atomicity yourself - currently atomic operations are within region
 boundaries.

 Cheers


 On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

  Hi,
 
  I'm currently putting everything into one table (to make cross reference
  queries easier) and there's one CF which contains rowkeys very
 different to
  the rest. Currently it works well, but I'm wondering if it will cause
  performance issues in the future.
 
  So my questions are
 
  1) will there be performance penalties in the way I'm doing?
  2) should I move that CF to a separate table?
 
 
  Thanks,
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Ted Yu
If you use monotonically increasing rowkeys, separating out the column
family into a new table would give you same issue you're facing today.

Using a single table, essential column family feature would reduce the
amount of heap memory used during scan. With two tables, there is no such
facility.

Cheers


On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi Ted,

 Yes, that's the table having RegionTooBusyExceptions :) But the performance
 I care most are scan performance.

 It's mostly for analytics, so I don't care much about atomicity currently.

 What's your suggestion?

 Jianshi


 On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote:

  Is this the same table you mentioned in the thread about
  RegionTooBusyException
  ?
 
  If you move the column family to another table, you may have to handle
  atomicity yourself - currently atomic operations are within region
  boundaries.
 
  Cheers
 
 
  On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:
 
   Hi,
  
   I'm currently putting everything into one table (to make cross
 reference
   queries easier) and there's one CF which contains rowkeys very
 different
  to
   the rest. Currently it works well, but I'm wondering if it will cause
   performance issues in the future.
  
   So my questions are
  
   1) will there be performance penalties in the way I'm doing?
   2) should I move that CF to a separate table?
  
  
   Thanks,
   --
   Jianshi Huang
  
   LinkedIn: jianshi
   Twitter: @jshuang
   Github  Blog: http://huangjs.github.com/
  
 



 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Thanks Ted,

I'll pre-split the table during ingestion. The reason to keep the rowkey
monotonic is for easier working with TableInputFormat, otherwise I would've
binned it into 256 splits. (well, I think a good way is to extend
TableInputFormat to accept multiple row ranges, if there's an existing
efficient implementation, please let me know :)

Would you elaborate a little more on the heap memory usage during scan? Is
there any reference to that?

Jianshi



On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote:

 If you use monotonically increasing rowkeys, separating out the column
 family into a new table would give you same issue you're facing today.

 Using a single table, essential column family feature would reduce the
 amount of heap memory used during scan. With two tables, there is no such
 facility.

 Cheers


 On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

  Hi Ted,
 
  Yes, that's the table having RegionTooBusyExceptions :) But the
 performance
  I care most are scan performance.
 
  It's mostly for analytics, so I don't care much about atomicity
 currently.
 
  What's your suggestion?
 
  Jianshi
 
 
  On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote:
 
   Is this the same table you mentioned in the thread about
   RegionTooBusyException
   ?
  
   If you move the column family to another table, you may have to handle
   atomicity yourself - currently atomic operations are within region
   boundaries.
  
   Cheers
  
  
   On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com
 
   wrote:
  
Hi,
   
I'm currently putting everything into one table (to make cross
  reference
queries easier) and there's one CF which contains rowkeys very
  different
   to
the rest. Currently it works well, but I'm wondering if it will cause
performance issues in the future.
   
So my questions are
   
1) will there be performance penalties in the way I'm doing?
2) should I move that CF to a separate table?
   
   
Thanks,
--
Jianshi Huang
   
LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/
   
  
 
 
 
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Ted Yu
Please refer to HBASE-5416 Filter on one CF and if a match, then load and
return full row

bq. to extend TableInputFormat to accept multiple row ranges

You mean extending hbase.mapreduce.scan.row.start and
hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ?
How many such ranges do you normally need ?

Cheers


On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Thanks Ted,

 I'll pre-split the table during ingestion. The reason to keep the rowkey
 monotonic is for easier working with TableInputFormat, otherwise I would've
 binned it into 256 splits. (well, I think a good way is to extend
 TableInputFormat to accept multiple row ranges, if there's an existing
 efficient implementation, please let me know :)

 Would you elaborate a little more on the heap memory usage during scan? Is
 there any reference to that?

 Jianshi



 On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote:

  If you use monotonically increasing rowkeys, separating out the column
  family into a new table would give you same issue you're facing today.
 
  Using a single table, essential column family feature would reduce the
  amount of heap memory used during scan. With two tables, there is no such
  facility.
 
  Cheers
 
 
  On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:
 
   Hi Ted,
  
   Yes, that's the table having RegionTooBusyExceptions :) But the
  performance
   I care most are scan performance.
  
   It's mostly for analytics, so I don't care much about atomicity
  currently.
  
   What's your suggestion?
  
   Jianshi
  
  
   On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote:
  
Is this the same table you mentioned in the thread about
RegionTooBusyException
?
   
If you move the column family to another table, you may have to
 handle
atomicity yourself - currently atomic operations are within region
boundaries.
   
Cheers
   
   
On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang 
 jianshi.hu...@gmail.com
  
wrote:
   
 Hi,

 I'm currently putting everything into one table (to make cross
   reference
 queries easier) and there's one CF which contains rowkeys very
   different
to
 the rest. Currently it works well, but I'm wondering if it will
 cause
 performance issues in the future.

 So my questions are

 1) will there be performance penalties in the way I'm doing?
 2) should I move that CF to a separate table?


 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/

   
  
  
  
   --
   Jianshi Huang
  
   LinkedIn: jianshi
   Twitter: @jshuang
   Github  Blog: http://huangjs.github.com/
  
 



 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Thanks Ted for the reference.

That's right, extend the row.start and row.end to specify multiple ranges
and also getSplits.

I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
256 ranges.

Jianshi



On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote:

 Please refer to HBASE-5416 Filter on one CF and if a match, then load and
 return full row

 bq. to extend TableInputFormat to accept multiple row ranges

 You mean extending hbase.mapreduce.scan.row.start and
 hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ?
 How many such ranges do you normally need ?

 Cheers


 On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

  Thanks Ted,
 
  I'll pre-split the table during ingestion. The reason to keep the rowkey
  monotonic is for easier working with TableInputFormat, otherwise I
 would've
  binned it into 256 splits. (well, I think a good way is to extend
  TableInputFormat to accept multiple row ranges, if there's an existing
  efficient implementation, please let me know :)
 
  Would you elaborate a little more on the heap memory usage during scan?
 Is
  there any reference to that?
 
  Jianshi
 
 
 
  On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote:
 
   If you use monotonically increasing rowkeys, separating out the column
   family into a new table would give you same issue you're facing today.
  
   Using a single table, essential column family feature would reduce the
   amount of heap memory used during scan. With two tables, there is no
 such
   facility.
  
   Cheers
  
  
   On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang 
 jianshi.hu...@gmail.com
   wrote:
  
Hi Ted,
   
Yes, that's the table having RegionTooBusyExceptions :) But the
   performance
I care most are scan performance.
   
It's mostly for analytics, so I don't care much about atomicity
   currently.
   
What's your suggestion?
   
Jianshi
   
   
On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote:
   
 Is this the same table you mentioned in the thread about
 RegionTooBusyException
 ?

 If you move the column family to another table, you may have to
  handle
 atomicity yourself - currently atomic operations are within region
 boundaries.

 Cheers


 On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang 
  jianshi.hu...@gmail.com
   
 wrote:

  Hi,
 
  I'm currently putting everything into one table (to make cross
reference
  queries easier) and there's one CF which contains rowkeys very
different
 to
  the rest. Currently it works well, but I'm wondering if it will
  cause
  performance issues in the future.
 
  So my questions are
 
  1) will there be performance penalties in the way I'm doing?
  2) should I move that CF to a separate table?
 
 
  Thanks,
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 

   
   
   
--
Jianshi Huang
   
LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/
   
  
 
 
 
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Each range might span multiple regions, depending on the data size I want
scan for MR jobs.

The ranges are dynamic, specified by the user, but the number of bins can
be static (when the table/schema is created).

Jianshi


On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. 16 to 256 ranges

 Would each range be within single region or the range may span regions ?
 Are the ranges dynamic ?

 Using command line for multiple ranges would be out of question. A file
 with ranges is needed.

 Cheers


 On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

  Thanks Ted for the reference.
 
  That's right, extend the row.start and row.end to specify multiple ranges
  and also getSplits.
 
  I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
  256 ranges.
 
  Jianshi
 
 
 
  On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote:
 
   Please refer to HBASE-5416 Filter on one CF and if a match, then load
 and
   return full row
  
   bq. to extend TableInputFormat to accept multiple row ranges
  
   You mean extending hbase.mapreduce.scan.row.start and
   hbase.mapreduce.scan.row.stop so that multiple ranges can be specified
 ?
   How many such ranges do you normally need ?
  
   Cheers
  
  
   On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang 
 jianshi.hu...@gmail.com
   wrote:
  
Thanks Ted,
   
I'll pre-split the table during ingestion. The reason to keep the
  rowkey
monotonic is for easier working with TableInputFormat, otherwise I
   would've
binned it into 256 splits. (well, I think a good way is to extend
TableInputFormat to accept multiple row ranges, if there's an
 existing
efficient implementation, please let me know :)
   
Would you elaborate a little more on the heap memory usage during
 scan?
   Is
there any reference to that?
   
Jianshi
   
   
   
On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote:
   
 If you use monotonically increasing rowkeys, separating out the
  column
 family into a new table would give you same issue you're facing
  today.

 Using a single table, essential column family feature would reduce
  the
 amount of heap memory used during scan. With two tables, there is
 no
   such
 facility.

 Cheers


 On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang 
   jianshi.hu...@gmail.com
 wrote:

  Hi Ted,
 
  Yes, that's the table having RegionTooBusyExceptions :) But the
 performance
  I care most are scan performance.
 
  It's mostly for analytics, so I don't care much about atomicity
 currently.
 
  What's your suggestion?
 
  Jianshi
 
 
  On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com
  wrote:
 
   Is this the same table you mentioned in the thread about
   RegionTooBusyException
   ?
  
   If you move the column family to another table, you may have to
handle
   atomicity yourself - currently atomic operations are within
  region
   boundaries.
  
   Cheers
  
  
   On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang 
jianshi.hu...@gmail.com
 
   wrote:
  
Hi,
   
I'm currently putting everything into one table (to make
 cross
  reference
queries easier) and there's one CF which contains rowkeys
 very
  different
   to
the rest. Currently it works well, but I'm wondering if it
 will
cause
performance issues in the future.
   
So my questions are
   
1) will there be performance penalties in the way I'm doing?
2) should I move that CF to a separate table?
   
   
Thanks,
--
Jianshi Huang
   
LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/
   
  
 
 
 
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 

   
   
   
--
Jianshi Huang
   
LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/
   
  
 
 
 
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
BTW, a little explanation about the binning I mentioned.

Currently the rowkey looks like type_of_events#rev_timestamp#id.

And with binning, it looks like
bin_number#type_of_events#rev_timestamp#id. The bin_number could be
id % 256 or timestamp % 256. And the table could be pre-splitted. So future
ingestions could do parallel insertion to #bin regions, even without
pre-split.


Jianshi


On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Each range might span multiple regions, depending on the data size I want
 scan for MR jobs.

 The ranges are dynamic, specified by the user, but the number of bins can
 be static (when the table/schema is created).

 Jianshi


 On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. 16 to 256 ranges

 Would each range be within single region or the range may span regions ?
 Are the ranges dynamic ?

 Using command line for multiple ranges would be out of question. A file
 with ranges is needed.

 Cheers


 On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

  Thanks Ted for the reference.
 
  That's right, extend the row.start and row.end to specify multiple
 ranges
  and also getSplits.
 
  I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
  256 ranges.
 
  Jianshi
 
 
 
  On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote:
 
   Please refer to HBASE-5416 Filter on one CF and if a match, then load
 and
   return full row
  
   bq. to extend TableInputFormat to accept multiple row ranges
  
   You mean extending hbase.mapreduce.scan.row.start and
   hbase.mapreduce.scan.row.stop so that multiple ranges can be
 specified ?
   How many such ranges do you normally need ?
  
   Cheers
  
  
   On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang 
 jianshi.hu...@gmail.com
   wrote:
  
Thanks Ted,
   
I'll pre-split the table during ingestion. The reason to keep the
  rowkey
monotonic is for easier working with TableInputFormat, otherwise I
   would've
binned it into 256 splits. (well, I think a good way is to extend
TableInputFormat to accept multiple row ranges, if there's an
 existing
efficient implementation, please let me know :)
   
Would you elaborate a little more on the heap memory usage during
 scan?
   Is
there any reference to that?
   
Jianshi
   
   
   
On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote:
   
 If you use monotonically increasing rowkeys, separating out the
  column
 family into a new table would give you same issue you're facing
  today.

 Using a single table, essential column family feature would reduce
  the
 amount of heap memory used during scan. With two tables, there is
 no
   such
 facility.

 Cheers


 On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang 
   jianshi.hu...@gmail.com
 wrote:

  Hi Ted,
 
  Yes, that's the table having RegionTooBusyExceptions :) But the
 performance
  I care most are scan performance.
 
  It's mostly for analytics, so I don't care much about atomicity
 currently.
 
  What's your suggestion?
 
  Jianshi
 
 
  On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com
  wrote:
 
   Is this the same table you mentioned in the thread about
   RegionTooBusyException
   ?
  
   If you move the column family to another table, you may have
 to
handle
   atomicity yourself - currently atomic operations are within
  region
   boundaries.
  
   Cheers
  
  
   On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang 
jianshi.hu...@gmail.com
 
   wrote:
  
Hi,
   
I'm currently putting everything into one table (to make
 cross
  reference
queries easier) and there's one CF which contains rowkeys
 very
  different
   to
the rest. Currently it works well, but I'm wondering if it
 will
cause
performance issues in the future.
   
So my questions are
   
1) will there be performance penalties in the way I'm doing?
2) should I move that CF to a separate table?
   
   
Thanks,
--
Jianshi Huang
   
LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/
   
  
 
 
 
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 

   
   
   
--
Jianshi Huang
   
LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/
   
  
 
 
 
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Michael Segel
Again, a silly question. 

Why are you using column families? 

Just to play devil’s advocate in terms of design, why are you not treating your 
row as a record? Think hierarchal not relational. 

This really gets in to some design theory. 

Think Column Family as a way to group data that has the same row key, reference 
the same thing, yet the data in each column family is used separately. 
The example I always turn to when teaching, is to think of an order entry 
system at a retailer. 

You generate data which is segmented by business process. (order entry, pick 
slips, shipping, invoicing) All reflect a single order, yet the data in each 
process tends to be accessed separately. 
(You don’t need the order entry when using the pick slip to pull orders from 
the warehouse.)  So here, the data access pattern is that each column family is 
used separately, except in generating the data (the order entry is used to 
generate the pick slip(s) and set up things like backorders and then the pick 
process generates the shipping slip(s) etc …  And since they are all focused on 
the same order, they have the same row key.

So its reasonable to ask how you are accessing the data and how you are 
designing your HBase model? 

Many times,  developers create a model using column families because the 
developer is thinking in terms of relationships. Not access patterns on the 
data. 

Does this make sense? 

 
On Sep 6, 2014, at 7:46 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:

 BTW, a little explanation about the binning I mentioned.
 
 Currently the rowkey looks like type_of_events#rev_timestamp#id.
 
 And with binning, it looks like
 bin_number#type_of_events#rev_timestamp#id. The bin_number could be
 id % 256 or timestamp % 256. And the table could be pre-splitted. So future
 ingestions could do parallel insertion to #bin regions, even without
 pre-split.
 
 
 Jianshi
 
 
 On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:
 
 Each range might span multiple regions, depending on the data size I want
 scan for MR jobs.
 
 The ranges are dynamic, specified by the user, but the number of bins can
 be static (when the table/schema is created).
 
 Jianshi
 
 
 On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote:
 
 bq. 16 to 256 ranges
 
 Would each range be within single region or the range may span regions ?
 Are the ranges dynamic ?
 
 Using command line for multiple ranges would be out of question. A file
 with ranges is needed.
 
 Cheers
 
 
 On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:
 
 Thanks Ted for the reference.
 
 That's right, extend the row.start and row.end to specify multiple
 ranges
 and also getSplits.
 
 I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
 256 ranges.
 
 Jianshi
 
 
 
 On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote:
 
 Please refer to HBASE-5416 Filter on one CF and if a match, then load
 and
 return full row
 
 bq. to extend TableInputFormat to accept multiple row ranges
 
 You mean extending hbase.mapreduce.scan.row.start and
 hbase.mapreduce.scan.row.stop so that multiple ranges can be
 specified ?
 How many such ranges do you normally need ?
 
 Cheers
 
 
 On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang 
 jianshi.hu...@gmail.com
 wrote:
 
 Thanks Ted,
 
 I'll pre-split the table during ingestion. The reason to keep the
 rowkey
 monotonic is for easier working with TableInputFormat, otherwise I
 would've
 binned it into 256 splits. (well, I think a good way is to extend
 TableInputFormat to accept multiple row ranges, if there's an
 existing
 efficient implementation, please let me know :)
 
 Would you elaborate a little more on the heap memory usage during
 scan?
 Is
 there any reference to that?
 
 Jianshi
 
 
 
 On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote:
 
 If you use monotonically increasing rowkeys, separating out the
 column
 family into a new table would give you same issue you're facing
 today.
 
 Using a single table, essential column family feature would reduce
 the
 amount of heap memory used during scan. With two tables, there is
 no
 such
 facility.
 
 Cheers
 
 
 On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang 
 jianshi.hu...@gmail.com
 wrote:
 
 Hi Ted,
 
 Yes, that's the table having RegionTooBusyExceptions :) But the
 performance
 I care most are scan performance.
 
 It's mostly for analytics, so I don't care much about atomicity
 currently.
 
 What's your suggestion?
 
 Jianshi
 
 
 On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com
 wrote:
 
 Is this the same table you mentioned in the thread about
 RegionTooBusyException
 ?
 
 If you move the column family to another table, you may have
 to
 handle
 atomicity yourself - currently atomic operations are within
 region
 boundaries.
 
 Cheers
 
 
 On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang 
 jianshi.hu...@gmail.com
 
 wrote:
 
 Hi,
 
 I'm currently putting everything 

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Hi Michael,

Thanks for the questions.

I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a
timestamp and I can query things like events between A and B for the last 7
days.

CFs are used for grouping different types of data for the same account.
However, I have lots of skews in the data, to avoid having too much for the
same row, I had to put what was in CQs to now RKs. So CF now acts more like
a table.

There's one CF containing sequence of events ordered by timestamp, and this
CF is quite different as the use case is mostly in mapreduce jobs.

Jianshi




On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel michael_se...@hotmail.com
wrote:

 Again, a silly question.

 Why are you using column families?

 Just to play devil’s advocate in terms of design, why are you not treating
 your row as a record? Think hierarchal not relational.

 This really gets in to some design theory.

 Think Column Family as a way to group data that has the same row key,
 reference the same thing, yet the data in each column family is used
 separately.
 The example I always turn to when teaching, is to think of an order entry
 system at a retailer.

 You generate data which is segmented by business process. (order entry,
 pick slips, shipping, invoicing) All reflect a single order, yet the data
 in each process tends to be accessed separately.
 (You don’t need the order entry when using the pick slip to pull orders
 from the warehouse.)  So here, the data access pattern is that each column
 family is used separately, except in generating the data (the order entry
 is used to generate the pick slip(s) and set up things like backorders and
 then the pick process generates the shipping slip(s) etc …  And since they
 are all focused on the same order, they have the same row key.

 So its reasonable to ask how you are accessing the data and how you are
 designing your HBase model?

 Many times,  developers create a model using column families because the
 developer is thinking in terms of relationships. Not access patterns on the
 data.

 Does this make sense?


 On Sep 6, 2014, at 7:46 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:

  BTW, a little explanation about the binning I mentioned.
 
  Currently the rowkey looks like type_of_events#rev_timestamp#id.
 
  And with binning, it looks like
  bin_number#type_of_events#rev_timestamp#id. The bin_number could
 be
  id % 256 or timestamp % 256. And the table could be pre-splitted. So
 future
  ingestions could do parallel insertion to #bin regions, even without
  pre-split.
 
 
  Jianshi
 
 
  On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:
 
  Each range might span multiple regions, depending on the data size I
 want
  scan for MR jobs.
 
  The ranges are dynamic, specified by the user, but the number of bins
 can
  be static (when the table/schema is created).
 
  Jianshi
 
 
  On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  bq. 16 to 256 ranges
 
  Would each range be within single region or the range may span regions
 ?
  Are the ranges dynamic ?
 
  Using command line for multiple ranges would be out of question. A file
  with ranges is needed.
 
  Cheers
 
 
  On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang 
 jianshi.hu...@gmail.com
  wrote:
 
  Thanks Ted for the reference.
 
  That's right, extend the row.start and row.end to specify multiple
  ranges
  and also getSplits.
 
  I would probably bin the event sequence CF into 16 to 256 bins. So 16
 to
  256 ranges.
 
  Jianshi
 
 
 
  On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  Please refer to HBASE-5416 Filter on one CF and if a match, then load
  and
  return full row
 
  bq. to extend TableInputFormat to accept multiple row ranges
 
  You mean extending hbase.mapreduce.scan.row.start and
  hbase.mapreduce.scan.row.stop so that multiple ranges can be
  specified ?
  How many such ranges do you normally need ?
 
  Cheers
 
 
  On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang 
  jianshi.hu...@gmail.com
  wrote:
 
  Thanks Ted,
 
  I'll pre-split the table during ingestion. The reason to keep the
  rowkey
  monotonic is for easier working with TableInputFormat, otherwise I
  would've
  binned it into 256 splits. (well, I think a good way is to extend
  TableInputFormat to accept multiple row ranges, if there's an
  existing
  efficient implementation, please let me know :)
 
  Would you elaborate a little more on the heap memory usage during
  scan?
  Is
  there any reference to that?
 
  Jianshi
 
 
 
  On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  If you use monotonically increasing rowkeys, separating out the
  column
  family into a new table would give you same issue you're facing
  today.
 
  Using a single table, essential column family feature would reduce
  the
  amount of heap memory used during scan. With two tables, there is
  no
  such
  facility.
 
  Cheers
 
 
  On Sat, Sep 6, 2014 at 10:11 AM,