Ok… Coprocessors are poorly implemented in HBase. If you work in a secure environment, outside of the system coprocessors… (ones that you load from hbase-site.xml) , you don’t want to use them. (The coprocessor code runs on the same JVM as the RS.) This means that if you have a poorly written coprocessor, you will kill performance for all of HBase. If you’re not using them in a secure environment, you have to consider how they are going to be used.
Without really knowing more about your use case..., its impossible to say of the coprocessor would be a good idea. It sounds like you may have an unrealistic expectation as to how well HBase performs. HTH -Mike > On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren <sto...@gmail.com> wrote: > > An HBase coprocessor. My idea is to move as much pre-aggregation as > possible to where the data lives in the region servers, instead of doing it > in the client. If there is good data locality inside and across rows within > regions then I would expect aggregation to be faster in the coprocessor > (utilize many region servers in parallel) rather than transfer data over > the network from multiple region servers to a single client that would do > the same calculation on its own. > > > On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel <michael_se...@hotmail.com> > wrote: > >> When you say coprocessor, do you mean HBase coprocessors or do you mean a >> physical hardware coprocessor? >> >> In terms of queries… >> >> HBase can perform a single get() and return the result back quickly. (The >> size of the data being returned will impact the overall timing.) >> >> HBase also caches the results so that your first hit will take the >> longest, but as long as the row is cached, the results are returned quickly. >> >> If you’re trying to do a scan with a start/stop row set … your timing then >> could vary between sub-second and minutes depending on the query. >> >> >>> On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <sto...@gmail.com> wrote: >>> >>> But if the coprocessor is omitted then CPU cycles from region servers are >>> lost, so where would the query execution go? >>> >>> Queries needs to be quick (sub-second rather than seconds) and HDFS is >>> quite latency hungry, unless there are optimizations that i'm unaware of? >>> >>> >>> >>> On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <michael_se...@hotmail.com >>> >>> wrote: >>> >>>> I think you misunderstood. >>>> >>>> The suggestion was to put the data in to HDFS sequence files and to use >>>> HBase to store an index in to the file. (URL to the file, then offset >> in to >>>> the file for the start of the record…) >>>> >>>> The reason you want to do this is that you’re reading in large amounts >> of >>>> data and its more efficient to do this from HDFS than through HBase. >>>> >>>>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <sto...@gmail.com> >> wrote: >>>>> >>>>> Yes, I think you're right. Adding one or more dimensions to the rowkey >>>>> would indeed make the table narrower. >>>>> >>>>> And I guess it also make sense to store actual values (bigger >> qualifiers) >>>>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on >>>> SSD >>>>> caches would be an interesting solution. And quite a bit simpler. >>>>> >>>>> Good call and thanks for the tip! :-) >>>>> >>>>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel < >> michael_se...@hotmail.com >>>>> >>>>> wrote: >>>>> >>>>>> Ok… >>>>>> >>>>>> First, I’d suggest you rethink your schema by adding an additional >>>>>> dimension. >>>>>> You’ll end up with more rows, but a narrower table. >>>>>> >>>>>> In terms of compaction… if the data is relatively static, you won’t >> have >>>>>> compactions because nothing changed. >>>>>> But if your data is that static… why not put the data in sequence >> files >>>>>> and use HBase as the index. Could be faster. >>>>>> >>>>>> HTH >>>>>> >>>>>> -Mike >>>>>> >>>>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <sto...@gmail.com> >>>> wrote: >>>>>>> >>>>>>> I just read through HBase MOB design document and one thing that >> caught >>>>>> my >>>>>>> attention was the following statement. >>>>>>> >>>>>>> "When HBase deals with large numbers of values > 100kb and up to >> ~10MB >>>> of >>>>>>> data, it encounters performance degradations due to write >> amplification >>>>>>> caused by splits and compactions." >>>>>>> >>>>>>> Is there any chance to run into this problem in the read path for >> data >>>>>> that >>>>>>> is written infrequently and never changed? >>>>>>> >>>>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <sto...@gmail.com >>> >>>>>> wrote: >>>>>>> >>>>>>>> A small set of qualifiers will be accessed frequently so keeping >> them >>>> in >>>>>>>> block cache would be very beneficial. Some very seldom. So this >> sounds >>>>>> very >>>>>>>> promising! >>>>>>>> >>>>>>>> The reason why i'm considering a coprocessor is that I need to >> provide >>>>>>>> very specific information in the query request. Same thing with the >>>>>>>> response. Queries are also highly parallelizable across rows and >> each >>>>>>>> individual query produce a valid result that may or may not be >>>>>> aggregated >>>>>>>> with other results in the client, maybe even inside the region if it >>>>>>>> contained multiple rows targeted by the query. >>>>>>>> >>>>>>>> So it's a bit like Phoenix but with a different storage format and >>>> query >>>>>>>> engine. >>>>>>>> >>>>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <ndimi...@gmail.com> >>>>>> wrote: >>>>>>>> >>>>>>>>> Those rows are written out into HBase blocks on cell boundaries. >> Your >>>>>>>>> column family has a BLOCK_SIZE attribute, which you may or may have >>>> no >>>>>>>>> overridden the default of 64k. Cells are written into a block until >>>> is >>>>>> it >>>>>>>>>> = the target block size. So your single 500mb row will be broken >>>> down >>>>>>>>> into >>>>>>>>> thousands of HFile blocks in some number of HFiles. Some of those >>>>>> blocks >>>>>>>>> may contain just a cell or two and be a couple MB in size, to hold >>>> the >>>>>>>>> largest of your cells. Those blocks will be loaded into the Block >>>>>> Cache as >>>>>>>>> they're accessed. If your careful with your access patterns and >> only >>>>>>>>> request cells that you need to evaluate, you'll only ever load the >>>>>> blocks >>>>>>>>> containing those cells into the cache. >>>>>>>>> >>>>>>>>>> Will the entire row be loaded or only the qualifiers I ask for? >>>>>>>>> >>>>>>>>> So then, the answer to your question is: it depends on how you're >>>>>>>>> interacting with the row from your coprocessor. The read path will >>>> only >>>>>>>>> load blocks that your scanner requests. If your coprocessor is >>>>>> producing >>>>>>>>> scanner with to seek to specific qualifiers, you'll only load those >>>>>>>>> blocks. >>>>>>>>> >>>>>>>>> Related question: Is there a reason you're using a coprocessor >>>> instead >>>>>> of >>>>>>>>> a >>>>>>>>> regular filter, or a simple qualified get/scan to access data from >>>>>> these >>>>>>>>> rows? The "default stuff" is already tuned to load data sparsely, >> as >>>>>> would >>>>>>>>> be desirable for your schema. >>>>>>>>> >>>>>>>>> -n >>>>>>>>> >>>>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren < >> sto...@gmail.com >>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Sorry I should have explained my use case a bit more. >>>>>>>>>> >>>>>>>>>> Yes, it's a pretty big row and it's "close" to worst case. >> Normally >>>>>>>>> there >>>>>>>>>> would be fewer qualifiers and the largest qualifiers would be >>>> smaller. >>>>>>>>>> >>>>>>>>>> The reason why these rows gets big is because they stores >> aggregated >>>>>>>>> data >>>>>>>>>> in indexed compressed form. This format allow for extremely fast >>>>>> queries >>>>>>>>>> (on local disk format) over billions of rows (not rows in HBase >>>>>> speak), >>>>>>>>>> when touching smaller areas of the data. If would store the data >> as >>>>>>>>> regular >>>>>>>>>> HBase rows things would get very slow unless I had many many >> region >>>>>>>>>> servers. >>>>>>>>>> >>>>>>>>>> The coprocessor is used for doing custom queries on the indexed >> data >>>>>>>>> inside >>>>>>>>>> the region servers. These queries are not like a regular row scan, >>>> but >>>>>>>>> very >>>>>>>>>> specific as to how the data is formatted withing each column >>>>>> qualifier. >>>>>>>>>> >>>>>>>>>> Yes, this is not possible if HBase loads the whole 500MB each >> time i >>>>>>>>> want >>>>>>>>>> to perform this custom query on a row. Hence my question :-) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel < >>>>>>>>> michael_se...@hotmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry, but your initial problem statement doesn’t seem to parse … >>>>>>>>>>> >>>>>>>>>>> Are you saying that you a single row with approximately 100,000 >>>>>>>>> elements >>>>>>>>>>> where each element is roughly 1-5KB in size and in addition there >>>> are >>>>>>>>> ~5 >>>>>>>>>>> elements which will be between one and five MB in size? >>>>>>>>>>> >>>>>>>>>>> And you then mention a coprocessor? >>>>>>>>>>> >>>>>>>>>>> Just looking at the numbers… 100K * 5KB means that each row would >>>> end >>>>>>>>> up >>>>>>>>>>> being 500MB in size. >>>>>>>>>>> >>>>>>>>>>> That’s a pretty fat row. >>>>>>>>>>> >>>>>>>>>>> I would suggest rethinking your strategy. >>>>>>>>>>> >>>>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren < >> sto...@gmail.com >>>>> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi >>>>>>>>>>>> >>>>>>>>>>>> I have a row with around 100.000 qualifiers with mostly small >>>> values >>>>>>>>>>> around >>>>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do >>>>>>>>> random >>>>>>>>>>>> access of 1-10 qualifiers per row. >>>>>>>>>>>> >>>>>>>>>>>> I would like to understand how HBase loads the data into memory. >>>>>>>>> Will >>>>>>>>>> the >>>>>>>>>>>> entire row be loaded or only the qualifiers I ask for (like >>>> pointer >>>>>>>>>>> access >>>>>>>>>>>> into a direct ByteBuffer) ? >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> -Kristoffer >>>>>>>>>>> >>>>>>>>>>> The opinions expressed here are mine, while they may reflect a >>>>>>>>> cognitive >>>>>>>>>>> thought, that is purely accidental. >>>>>>>>>>> Use at your own risk. >>>>>>>>>>> Michael Segel >>>>>>>>>>> michael_segel (AT) hotmail.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> The opinions expressed here are mine, while they may reflect a >> cognitive >>>>>> thought, that is purely accidental. >>>>>> Use at your own risk. >>>>>> Michael Segel >>>>>> michael_segel (AT) hotmail.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> The opinions expressed here are mine, while they may reflect a cognitive >>>> thought, that is purely accidental. >>>> Use at your own risk. >>>> Michael Segel >>>> michael_segel (AT) hotmail.com >>>> >>>> >>>> >>>> >>>> >>>> >> >> The opinions expressed here are mine, while they may reflect a cognitive >> thought, that is purely accidental. >> Use at your own risk. >> Michael Segel >> michael_segel (AT) hotmail.com >> >> >> >> >> >> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com