Re: Rowkey design question

2015-04-17 Thread Michael Segel
. Hope this helps. Please let us know how it goes. -- Lars From: Kristoffer Sjögren sto...@gmail.com To: user@hbase.apache.org Sent: Wednesday, April 8, 2015 6:41 AM Subject: Re: Rowkey design question Yes, I think you're right. Adding one or more

Re: Rowkey design question

2015-04-12 Thread lars hofhansl
...@gmail.com To: user@hbase.apache.org Sent: Wednesday, April 8, 2015 6:41 AM Subject: Re: Rowkey design question Yes, I think you're right. Adding one or more dimensions to the rowkey would indeed make the table narrower. And I guess it also make sense to store actual values (bigger qualifiers

Re: Rowkey design question

2015-04-11 Thread Sean Busbey
: Thursday, April 9, 2015 4:53 PM Subject: Re: Rowkey design question On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel michael_se...@hotmail.com wrote: Hint: You could have sandboxed the end user code which makes it a lot easier to manage. I filed the fucking JIRA for that. Look

Re: Rowkey design question

2015-04-11 Thread Kevin O'dell
Trying to figure out the best place to jump in here... Kristoffer, I would like to echo what Michael and Andrew have said. While a pre-aggregation co-proc may work in my experience with co-procs they are typically more trouble than they are worth. I would first try this outside the client

Re: Rowkey design question

2015-04-11 Thread Andrew Purtell
. -- Lars From: Andrew Purtell apurt...@apache.org javascript:; To: user@hbase.apache.org javascript:; user@hbase.apache.org javascript:; Sent: Thursday, April 9, 2015 4:53 PM Subject: Re: Rowkey design question On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel

Re: Rowkey design question

2015-04-11 Thread Michael Segel
apurt...@apache.org To: user@hbase.apache.org user@hbase.apache.org Sent: Thursday, April 9, 2015 4:53 PM Subject: Re: Rowkey design question On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel michael_se...@hotmail.com wrote: Hint: You could have sandboxed the end user code which makes it a lot

Re: Rowkey design question

2015-04-09 Thread Michael Segel
Ok… Coprocessors are poorly implemented in HBase. If you work in a secure environment, outside of the system coprocessors… (ones that you load from hbase-site.xml) , you don’t want to use them. (The coprocessor code runs on the same JVM as the RS.) This means that if you have a poorly

Re: Rowkey design question

2015-04-09 Thread Kristoffer Sjögren
An HBase coprocessor. My idea is to move as much pre-aggregation as possible to where the data lives in the region servers, instead of doing it in the client. If there is good data locality inside and across rows within regions then I would expect aggregation to be faster in the coprocessor

Re: Rowkey design question

2015-04-09 Thread Michael Segel
Andrew, In a nutshell running end user code within the RS JVM is a bad design. To be clear, this is not just my opinion… I just happen to be more vocal about it. ;-) We’ve covered this ground before and just because the code runs doesn’t mean its good. Or that the design is good. I would

Re: Rowkey design question

2015-04-09 Thread Andrew Purtell
This is one person's opinion, to which he is absolutely entitled to, but blanket black and white statements like coprocessors are poorly implemented is obviously not an opinion shared by all those who have used them successfully, nor the HBase committers, or we would remove the feature. On the

Re: Rowkey design question

2015-04-08 Thread Michael Segel
When you say coprocessor, do you mean HBase coprocessors or do you mean a physical hardware coprocessor? In terms of queries… HBase can perform a single get() and return the result back quickly. (The size of the data being returned will impact the overall timing.) HBase also caches the

Re: Rowkey design question

2015-04-08 Thread Michael Segel
Ok… First, I’d suggest you rethink your schema by adding an additional dimension. You’ll end up with more rows, but a narrower table. In terms of compaction… if the data is relatively static, you won’t have compactions because nothing changed. But if your data is that static… why not put

Re: Rowkey design question

2015-04-08 Thread Kristoffer Sjögren
I just read through HBase MOB design document and one thing that caught my attention was the following statement. When HBase deals with large numbers of values 100kb and up to ~10MB of data, it encounters performance degradations due to write amplification caused by splits and compactions. Is

Re: Rowkey design question

2015-04-08 Thread Kristoffer Sjögren
A small set of qualifiers will be accessed frequently so keeping them in block cache would be very beneficial. Some very seldom. So this sounds very promising! The reason why i'm considering a coprocessor is that I need to provide very specific information in the query request. Same thing with

Re: Rowkey design question

2015-04-08 Thread Kristoffer Sjögren
Yes, I think you're right. Adding one or more dimensions to the rowkey would indeed make the table narrower. And I guess it also make sense to store actual values (bigger qualifiers) outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on SSD caches would be an interesting

Re: Rowkey design question

2015-04-08 Thread Michael Segel
I think you misunderstood. The suggestion was to put the data in to HDFS sequence files and to use HBase to store an index in to the file. (URL to the file, then offset in to the file for the start of the record…) The reason you want to do this is that you’re reading in large amounts of data

Re: Rowkey design question

2015-04-08 Thread Kristoffer Sjögren
But if the coprocessor is omitted then CPU cycles from region servers are lost, so where would the query execution go? Queries needs to be quick (sub-second rather than seconds) and HDFS is quite latency hungry, unless there are optimizations that i'm unaware of? On Wed, Apr 8, 2015 at 7:43

Rowkey design question

2015-04-07 Thread Kristoffer Sjögren
Hi I have a row with around 100.000 qualifiers with mostly small values around 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random access of 1-10 qualifiers per row. I would like to understand how HBase loads the data into memory. Will the entire row be loaded or only the

Re: Rowkey design question

2015-04-07 Thread Imants Cekusins
how HBase loads the data into memory. If you init Get and specify columns with addColumn, it is likely that only data for these columns is read and loaded in memory. Rowkey is best kept short. So are column qualifiers.

Re: Rowkey design question

2015-04-07 Thread Kristoffer Sjögren
Sorry I should have explained my use case a bit more. Yes, it's a pretty big row and it's close to worst case. Normally there would be fewer qualifiers and the largest qualifiers would be smaller. The reason why these rows gets big is because they stores aggregated data in indexed compressed

Re: Rowkey design question

2015-04-07 Thread Michael Segel
Sorry, but your initial problem statement doesn’t seem to parse … Are you saying that you a single row with approximately 100,000 elements where each element is roughly 1-5KB in size and in addition there are ~5 elements which will be between one and five MB in size? And you then mention a

Re: Rowkey design question

2015-04-07 Thread Nick Dimiduk
Those rows are written out into HBase blocks on cell boundaries. Your column family has a BLOCK_SIZE attribute, which you may or may have no overridden the default of 64k. Cells are written into a block until is it = the target block size. So your single 500mb row will be broken down into

Re: Rowkey design question

2013-02-21 Thread Asaf Mesika
An easier way is to place one byte before the time stamp which is called a bucket. You can calculate it by using modulu on the time stamp by the number of buckets. We are now in the process of field testing it. On Tuesday, February 19, 2013, Paul van Hoven wrote: Yeah it worked fine. But as

Re: Rowkey design question

2013-02-21 Thread Mohammad Tariq
Another good point. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Feb 22, 2013 at 3:45 AM, Asaf Mesika asaf.mes...@gmail.com wrote: An easier way is to place one byte before the time stamp which is called a bucket. You can calculate it by using modulu on the

Rowkey design question

2013-02-19 Thread Paul van Hoven
Hi, I'm currently playing with hbase. The design of the rowkey seems to be critical. The rowkey for a certain database table of mine is: timestamp+ipaddress It looks something like this when performing a scan on the table in the shell: hbase(main):012:0 scan 'ToyDataTable' ROW

Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
Hello Paul, Try this and see if it works : scan.setStartRow(Bytes.toBytes(startDate.getTime() + )); scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + )); Also try not to use TS as the rowkey, as it may lead to RS hotspotting. Just add a hash to your rowkeys so that data is

Re: Rowkey design question

2013-02-19 Thread Paul van Hoven
Hey Tariq, thanks for your quick answer. I'm not sure if I got the idea in the seond part of your answer. You mean if I use a timestamp as a rowkey I should append a hash like this: 135727920+MD5HASH and then the data would be distributed more equally? 2013/2/19 Mohammad Tariq

Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
No. before the timestamp. All the row keys which are identical go to the same region. This is the default Hbase behavior and is meant to make the performance better. But sometimes the machine gets overloaded with reads and writes because we get concentrated on that particular machine. For example

Re: Rowkey design question

2013-02-19 Thread Paul van Hoven
Yeah it worked fine. But as I understand: If I prefix my row key with something like md5-hash + timestamp then the rowkeys are probably evenly distributed but how would I perform then a scan restricted to a special time range? 2013/2/19 Mohammad Tariq donta...@gmail.com: No. before the

Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
You can use FuzzyRowFilterhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FuzzyRowFilter.htmlto do that. Have a look at this linkhttp://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/. You might find it helpful. Warm