Re: One-table w/ multi-CF or multi-table w/ one-CF?
Locality? Then the data should be in the same column family. That’s as local as you can get. I would suggest that you think of the following: What’s the predominant use case? How are you querying the data. If you’re always hitting multiple CFs to get the data… then you should have it in the same table. I think more people would benefit if they took more time thinking about their design and how the data is being used and stored… it would help. Also knowing that there really isn’t a single ‘right’ answer. Just a lot of wrong ones. ;-) Most people still try to think of HBase in terms of relational modeling and not in terms of records and more of a hierarchial system. Things like CFs and Versioning are often misused because people see them as shortcuts. Also people tend not to think of their data in HBase in terms of 3D but in terms of 2D. (CF’s would be 2+D) The one question which really hasn’t been answered is how fat is fat in terms of a row’s width and when is it too fat? This may seem like a simple thing, but it can impact a couple of things in your design. (I never got a good answer, and its one of those questions that if your wife were to ask if the pants she’s wearing makes her fat, its time to run for the hills because you can’t win no matter how you answer!) Seriously though, the optimal width of the column is not that easy to answer and sometimes you have to just guess as to which would be a better design. One of the problems with CFs is that if there’s an imbalance in terms of the size of data being stored in each CF, you can run in to issues. CFs are stored in separate files and split when the base CF splits. (Assuming you have a base CF and then multiple CFs that are related but store smaller records per row.) And then there’s the issue in terms of each CF is stored separately. (If memory serves its a separate file per CF, but right now my last living brain cell decided to call it quits and went on strike for more beer.) [Damn you last brain cell!!!] :-) Again the idea is to follow KISS. HTH -Mike On Sep 8, 2014, at 7:17 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Locality is important, that why I chose CF to put related data into one group. I can surely put the CF part to the head of rowkey to achieve similar result, but since the number of types is fixed, I don't any benefit doing that. With the setLoadColumnFamiliesOnDemand I learned from Ted, looks like the performance should be similar. Am I missing something? Please enlighten me. Jianshi On Mon, Sep 8, 2014 at 3:41 AM, Michael Segel michael_se...@hotmail.com wrote: I would suggest rethinking column families and look at your potential for a slightly different row key. Going with column families doesn’t really make sense. Also how wide are the rows? (worst case?) one idea is to make type part of the RK… HTH -Mike On Sep 7, 2014, at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Michael, Thanks for the questions. I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a timestamp and I can query things like events between A and B for the last 7 days. CFs are used for grouping different types of data for the same account. However, I have lots of skews in the data, to avoid having too much for the same row, I had to put what was in CQs to now RKs. So CF now acts more like a table. There's one CF containing sequence of events ordered by timestamp, and this CF is quite different as the use case is mostly in mapreduce jobs. Jianshi On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel michael_se...@hotmail.com wrote: Again, a silly question. Why are you using column families? Just to play devil’s advocate in terms of design, why are you not treating your row as a record? Think hierarchal not relational. This really gets in to some design theory. Think Column Family as a way to group data that has the same row key, reference the same thing, yet the data in each column family is used separately. The example I always turn to when teaching, is to think of an order entry system at a retailer. You generate data which is segmented by business process. (order entry, pick slips, shipping, invoicing) All reflect a single order, yet the data in each process tends to be accessed separately. (You don’t need the order entry when using the pick slip to pull orders from the warehouse.) So here, the data access pattern is that each column family is used separately, except in generating the data (the order entry is used to generate the pick slip(s) and set up things like backorders and then the pick process generates the shipping slip(s) etc … And since they are all focused on the same order, they have the same row key. So its reasonable to ask how you are accessing the data and how you are designing your HBase model? Many times, developers create a model using
Re: One-table w/ multi-CF or multi-table w/ one-CF?
I would suggest rethinking column families and look at your potential for a slightly different row key. Going with column families doesn’t really make sense. Also how wide are the rows? (worst case?) one idea is to make type part of the RK… HTH -Mike On Sep 7, 2014, at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Michael, Thanks for the questions. I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a timestamp and I can query things like events between A and B for the last 7 days. CFs are used for grouping different types of data for the same account. However, I have lots of skews in the data, to avoid having too much for the same row, I had to put what was in CQs to now RKs. So CF now acts more like a table. There's one CF containing sequence of events ordered by timestamp, and this CF is quite different as the use case is mostly in mapreduce jobs. Jianshi On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel michael_se...@hotmail.com wrote: Again, a silly question. Why are you using column families? Just to play devil’s advocate in terms of design, why are you not treating your row as a record? Think hierarchal not relational. This really gets in to some design theory. Think Column Family as a way to group data that has the same row key, reference the same thing, yet the data in each column family is used separately. The example I always turn to when teaching, is to think of an order entry system at a retailer. You generate data which is segmented by business process. (order entry, pick slips, shipping, invoicing) All reflect a single order, yet the data in each process tends to be accessed separately. (You don’t need the order entry when using the pick slip to pull orders from the warehouse.) So here, the data access pattern is that each column family is used separately, except in generating the data (the order entry is used to generate the pick slip(s) and set up things like backorders and then the pick process generates the shipping slip(s) etc … And since they are all focused on the same order, they have the same row key. So its reasonable to ask how you are accessing the data and how you are designing your HBase model? Many times, developers create a model using column families because the developer is thinking in terms of relationships. Not access patterns on the data. Does this make sense? On Sep 6, 2014, at 7:46 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: BTW, a little explanation about the binning I mentioned. Currently the rowkey looks like type_of_events#rev_timestamp#id. And with binning, it looks like bin_number#type_of_events#rev_timestamp#id. The bin_number could be id % 256 or timestamp % 256. And the table could be pre-splitted. So future ingestions could do parallel insertion to #bin regions, even without pre-split. Jianshi On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Each range might span multiple regions, depending on the data size I want scan for MR jobs. The ranges are dynamic, specified by the user, but the number of bins can be static (when the table/schema is created). Jianshi On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote: bq. 16 to 256 ranges Would each range be within single region or the range may span regions ? Are the ranges dynamic ? Using command line for multiple ranges would be out of question. A file with ranges is needed. Cheers On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted for the reference. That's right, extend the row.start and row.end to specify multiple ranges and also getSplits. I would probably bin the event sequence CF into 16 to 256 bins. So 16 to 256 ranges. Jianshi On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote: Please refer to HBASE-5416 Filter on one CF and if a match, then load and return full row bq. to extend TableInputFormat to accept multiple row ranges You mean extending hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? How many such ranges do you normally need ? Cheers On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing efficient implementation, please let me know :) Would you elaborate a little more on the heap memory usage during scan? Is there any reference to that? Jianshi On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote: If you use monotonically increasing rowkeys, separating
One-table w/ multi-CF or multi-table w/ one-CF?
Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote: Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Well, write performance is also important... I'll probably ingest 1k~10k records/second. Jianshi On Sun, Sep 7, 2014 at 1:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote: Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
If you use monotonically increasing rowkeys, separating out the column family into a new table would give you same issue you're facing today. Using a single table, essential column family feature would reduce the amount of heap memory used during scan. With two tables, there is no such facility. Cheers On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote: Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing efficient implementation, please let me know :) Would you elaborate a little more on the heap memory usage during scan? Is there any reference to that? Jianshi On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote: If you use monotonically increasing rowkeys, separating out the column family into a new table would give you same issue you're facing today. Using a single table, essential column family feature would reduce the amount of heap memory used during scan. With two tables, there is no such facility. Cheers On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote: Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Please refer to HBASE-5416 Filter on one CF and if a match, then load and return full row bq. to extend TableInputFormat to accept multiple row ranges You mean extending hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? How many such ranges do you normally need ? Cheers On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing efficient implementation, please let me know :) Would you elaborate a little more on the heap memory usage during scan? Is there any reference to that? Jianshi On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote: If you use monotonically increasing rowkeys, separating out the column family into a new table would give you same issue you're facing today. Using a single table, essential column family feature would reduce the amount of heap memory used during scan. With two tables, there is no such facility. Cheers On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote: Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Thanks Ted for the reference. That's right, extend the row.start and row.end to specify multiple ranges and also getSplits. I would probably bin the event sequence CF into 16 to 256 bins. So 16 to 256 ranges. Jianshi On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote: Please refer to HBASE-5416 Filter on one CF and if a match, then load and return full row bq. to extend TableInputFormat to accept multiple row ranges You mean extending hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? How many such ranges do you normally need ? Cheers On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing efficient implementation, please let me know :) Would you elaborate a little more on the heap memory usage during scan? Is there any reference to that? Jianshi On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote: If you use monotonically increasing rowkeys, separating out the column family into a new table would give you same issue you're facing today. Using a single table, essential column family feature would reduce the amount of heap memory used during scan. With two tables, there is no such facility. Cheers On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote: Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Each range might span multiple regions, depending on the data size I want scan for MR jobs. The ranges are dynamic, specified by the user, but the number of bins can be static (when the table/schema is created). Jianshi On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote: bq. 16 to 256 ranges Would each range be within single region or the range may span regions ? Are the ranges dynamic ? Using command line for multiple ranges would be out of question. A file with ranges is needed. Cheers On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted for the reference. That's right, extend the row.start and row.end to specify multiple ranges and also getSplits. I would probably bin the event sequence CF into 16 to 256 bins. So 16 to 256 ranges. Jianshi On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote: Please refer to HBASE-5416 Filter on one CF and if a match, then load and return full row bq. to extend TableInputFormat to accept multiple row ranges You mean extending hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? How many such ranges do you normally need ? Cheers On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing efficient implementation, please let me know :) Would you elaborate a little more on the heap memory usage during scan? Is there any reference to that? Jianshi On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote: If you use monotonically increasing rowkeys, separating out the column family into a new table would give you same issue you're facing today. Using a single table, essential column family feature would reduce the amount of heap memory used during scan. With two tables, there is no such facility. Cheers On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote: Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: One-table w/ multi-CF or multi-table w/ one-CF?
BTW, a little explanation about the binning I mentioned. Currently the rowkey looks like type_of_events#rev_timestamp#id. And with binning, it looks like bin_number#type_of_events#rev_timestamp#id. The bin_number could be id % 256 or timestamp % 256. And the table could be pre-splitted. So future ingestions could do parallel insertion to #bin regions, even without pre-split. Jianshi On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Each range might span multiple regions, depending on the data size I want scan for MR jobs. The ranges are dynamic, specified by the user, but the number of bins can be static (when the table/schema is created). Jianshi On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote: bq. 16 to 256 ranges Would each range be within single region or the range may span regions ? Are the ranges dynamic ? Using command line for multiple ranges would be out of question. A file with ranges is needed. Cheers On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted for the reference. That's right, extend the row.start and row.end to specify multiple ranges and also getSplits. I would probably bin the event sequence CF into 16 to 256 bins. So 16 to 256 ranges. Jianshi On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote: Please refer to HBASE-5416 Filter on one CF and if a match, then load and return full row bq. to extend TableInputFormat to accept multiple row ranges You mean extending hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? How many such ranges do you normally need ? Cheers On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing efficient implementation, please let me know :) Would you elaborate a little more on the heap memory usage during scan? Is there any reference to that? Jianshi On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote: If you use monotonically increasing rowkeys, separating out the column family into a new table would give you same issue you're facing today. Using a single table, essential column family feature would reduce the amount of heap memory used during scan. With two tables, there is no such facility. Cheers On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote: Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be performance penalties in the way I'm doing? 2) should I move that CF to a separate table? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Again, a silly question. Why are you using column families? Just to play devil’s advocate in terms of design, why are you not treating your row as a record? Think hierarchal not relational. This really gets in to some design theory. Think Column Family as a way to group data that has the same row key, reference the same thing, yet the data in each column family is used separately. The example I always turn to when teaching, is to think of an order entry system at a retailer. You generate data which is segmented by business process. (order entry, pick slips, shipping, invoicing) All reflect a single order, yet the data in each process tends to be accessed separately. (You don’t need the order entry when using the pick slip to pull orders from the warehouse.) So here, the data access pattern is that each column family is used separately, except in generating the data (the order entry is used to generate the pick slip(s) and set up things like backorders and then the pick process generates the shipping slip(s) etc … And since they are all focused on the same order, they have the same row key. So its reasonable to ask how you are accessing the data and how you are designing your HBase model? Many times, developers create a model using column families because the developer is thinking in terms of relationships. Not access patterns on the data. Does this make sense? On Sep 6, 2014, at 7:46 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: BTW, a little explanation about the binning I mentioned. Currently the rowkey looks like type_of_events#rev_timestamp#id. And with binning, it looks like bin_number#type_of_events#rev_timestamp#id. The bin_number could be id % 256 or timestamp % 256. And the table could be pre-splitted. So future ingestions could do parallel insertion to #bin regions, even without pre-split. Jianshi On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Each range might span multiple regions, depending on the data size I want scan for MR jobs. The ranges are dynamic, specified by the user, but the number of bins can be static (when the table/schema is created). Jianshi On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote: bq. 16 to 256 ranges Would each range be within single region or the range may span regions ? Are the ranges dynamic ? Using command line for multiple ranges would be out of question. A file with ranges is needed. Cheers On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted for the reference. That's right, extend the row.start and row.end to specify multiple ranges and also getSplits. I would probably bin the event sequence CF into 16 to 256 bins. So 16 to 256 ranges. Jianshi On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote: Please refer to HBASE-5416 Filter on one CF and if a match, then load and return full row bq. to extend TableInputFormat to accept multiple row ranges You mean extending hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? How many such ranges do you normally need ? Cheers On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing efficient implementation, please let me know :) Would you elaborate a little more on the heap memory usage during scan? Is there any reference to that? Jianshi On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote: If you use monotonically increasing rowkeys, separating out the column family into a new table would give you same issue you're facing today. Using a single table, essential column family feature would reduce the amount of heap memory used during scan. With two tables, there is no such facility. Cheers On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote: Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm currently putting everything
Re: One-table w/ multi-CF or multi-table w/ one-CF?
Hi Michael, Thanks for the questions. I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a timestamp and I can query things like events between A and B for the last 7 days. CFs are used for grouping different types of data for the same account. However, I have lots of skews in the data, to avoid having too much for the same row, I had to put what was in CQs to now RKs. So CF now acts more like a table. There's one CF containing sequence of events ordered by timestamp, and this CF is quite different as the use case is mostly in mapreduce jobs. Jianshi On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel michael_se...@hotmail.com wrote: Again, a silly question. Why are you using column families? Just to play devil’s advocate in terms of design, why are you not treating your row as a record? Think hierarchal not relational. This really gets in to some design theory. Think Column Family as a way to group data that has the same row key, reference the same thing, yet the data in each column family is used separately. The example I always turn to when teaching, is to think of an order entry system at a retailer. You generate data which is segmented by business process. (order entry, pick slips, shipping, invoicing) All reflect a single order, yet the data in each process tends to be accessed separately. (You don’t need the order entry when using the pick slip to pull orders from the warehouse.) So here, the data access pattern is that each column family is used separately, except in generating the data (the order entry is used to generate the pick slip(s) and set up things like backorders and then the pick process generates the shipping slip(s) etc … And since they are all focused on the same order, they have the same row key. So its reasonable to ask how you are accessing the data and how you are designing your HBase model? Many times, developers create a model using column families because the developer is thinking in terms of relationships. Not access patterns on the data. Does this make sense? On Sep 6, 2014, at 7:46 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: BTW, a little explanation about the binning I mentioned. Currently the rowkey looks like type_of_events#rev_timestamp#id. And with binning, it looks like bin_number#type_of_events#rev_timestamp#id. The bin_number could be id % 256 or timestamp % 256. And the table could be pre-splitted. So future ingestions could do parallel insertion to #bin regions, even without pre-split. Jianshi On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Each range might span multiple regions, depending on the data size I want scan for MR jobs. The ranges are dynamic, specified by the user, but the number of bins can be static (when the table/schema is created). Jianshi On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote: bq. 16 to 256 ranges Would each range be within single region or the range may span regions ? Are the ranges dynamic ? Using command line for multiple ranges would be out of question. A file with ranges is needed. Cheers On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted for the reference. That's right, extend the row.start and row.end to specify multiple ranges and also getSplits. I would probably bin the event sequence CF into 16 to 256 bins. So 16 to 256 ranges. Jianshi On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote: Please refer to HBASE-5416 Filter on one CF and if a match, then load and return full row bq. to extend TableInputFormat to accept multiple row ranges You mean extending hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? How many such ranges do you normally need ? Cheers On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing efficient implementation, please let me know :) Would you elaborate a little more on the heap memory usage during scan? Is there any reference to that? Jianshi On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote: If you use monotonically increasing rowkeys, separating out the column family into a new table would give you same issue you're facing today. Using a single table, essential column family feature would reduce the amount of heap memory used during scan. With two tables, there is no such facility. Cheers On Sat, Sep 6, 2014 at 10:11 AM,