Andrew, In a nutshell running end user code within the RS JVM is a bad design. To be clear, this is not just my opinion… I just happen to be more vocal about it. ;-) We’ve covered this ground before and just because the code runs doesn’t mean its good. Or that the design is good.
I would love to see how you can justify HBase as being secure when you have end user code running in the same JVM as the RS. I can think of several ways to hack HBase security because of this… Note: I’m not saying server side extensibility is bad, I’m saying how it was implemented was bad. Hint: You could have sandboxed the end user code which makes it a lot easier to manage. MapR has avoided this in their MapRDB. They’re adding the extensibility in a different manner and this issue is nothing new. And yes. you’ve hit the nail on the head. Rethink your design if you want to use coprocessors and use them as a last resort. > On Apr 9, 2015, at 3:02 PM, Andrew Purtell <apurt...@apache.org> wrote: > > This is one person's opinion, to which he is absolutely entitled to, but > blanket black and white statements like "coprocessors are poorly > implemented" is obviously not an opinion shared by all those who have used > them successfully, nor the HBase committers, or we would remove the > feature. On the other hand, you should really ask yourself if in-server > extension is necessary. That should be a last resort, really, for the > security and performance considerations Michael mentions. > > > On Thu, Apr 9, 2015 at 5:05 AM, Michael Segel <michael_se...@hotmail.com> > wrote: > >> Ok… >> Coprocessors are poorly implemented in HBase. >> If you work in a secure environment, outside of the system coprocessors… >> (ones that you load from hbase-site.xml) , you don’t want to use them. (The >> coprocessor code runs on the same JVM as the RS.) This means that if you >> have a poorly written coprocessor, you will kill performance for all of >> HBase. If you’re not using them in a secure environment, you have to >> consider how they are going to be used. >> >> >> Without really knowing more about your use case..., its impossible to say >> of the coprocessor would be a good idea. >> >> >> It sounds like you may have an unrealistic expectation as to how well >> HBase performs. >> >> HTH >> >> -Mike >> >>> On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren <sto...@gmail.com> wrote: >>> >>> An HBase coprocessor. My idea is to move as much pre-aggregation as >>> possible to where the data lives in the region servers, instead of doing >> it >>> in the client. If there is good data locality inside and across rows >> within >>> regions then I would expect aggregation to be faster in the coprocessor >>> (utilize many region servers in parallel) rather than transfer data over >>> the network from multiple region servers to a single client that would do >>> the same calculation on its own. >>> >>> >>> On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel <michael_se...@hotmail.com >>> >>> wrote: >>> >>>> When you say coprocessor, do you mean HBase coprocessors or do you mean >> a >>>> physical hardware coprocessor? >>>> >>>> In terms of queries… >>>> >>>> HBase can perform a single get() and return the result back quickly. >> (The >>>> size of the data being returned will impact the overall timing.) >>>> >>>> HBase also caches the results so that your first hit will take the >>>> longest, but as long as the row is cached, the results are returned >> quickly. >>>> >>>> If you’re trying to do a scan with a start/stop row set … your timing >> then >>>> could vary between sub-second and minutes depending on the query. >>>> >>>> >>>>> On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <sto...@gmail.com> >> wrote: >>>>> >>>>> But if the coprocessor is omitted then CPU cycles from region servers >> are >>>>> lost, so where would the query execution go? >>>>> >>>>> Queries needs to be quick (sub-second rather than seconds) and HDFS is >>>>> quite latency hungry, unless there are optimizations that i'm unaware >> of? >>>>> >>>>> >>>>> >>>>> On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel < >> michael_se...@hotmail.com >>>>> >>>>> wrote: >>>>> >>>>>> I think you misunderstood. >>>>>> >>>>>> The suggestion was to put the data in to HDFS sequence files and to >> use >>>>>> HBase to store an index in to the file. (URL to the file, then offset >>>> in to >>>>>> the file for the start of the record…) >>>>>> >>>>>> The reason you want to do this is that you’re reading in large amounts >>>> of >>>>>> data and its more efficient to do this from HDFS than through HBase. >>>>>> >>>>>>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <sto...@gmail.com> >>>> wrote: >>>>>>> >>>>>>> Yes, I think you're right. Adding one or more dimensions to the >> rowkey >>>>>>> would indeed make the table narrower. >>>>>>> >>>>>>> And I guess it also make sense to store actual values (bigger >>>> qualifiers) >>>>>>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out >> on >>>>>> SSD >>>>>>> caches would be an interesting solution. And quite a bit simpler. >>>>>>> >>>>>>> Good call and thanks for the tip! :-) >>>>>>> >>>>>>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel < >>>> michael_se...@hotmail.com >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Ok… >>>>>>>> >>>>>>>> First, I’d suggest you rethink your schema by adding an additional >>>>>>>> dimension. >>>>>>>> You’ll end up with more rows, but a narrower table. >>>>>>>> >>>>>>>> In terms of compaction… if the data is relatively static, you won’t >>>> have >>>>>>>> compactions because nothing changed. >>>>>>>> But if your data is that static… why not put the data in sequence >>>> files >>>>>>>> and use HBase as the index. Could be faster. >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> -Mike >>>>>>>> >>>>>>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <sto...@gmail.com> >>>>>> wrote: >>>>>>>>> >>>>>>>>> I just read through HBase MOB design document and one thing that >>>> caught >>>>>>>> my >>>>>>>>> attention was the following statement. >>>>>>>>> >>>>>>>>> "When HBase deals with large numbers of values > 100kb and up to >>>> ~10MB >>>>>> of >>>>>>>>> data, it encounters performance degradations due to write >>>> amplification >>>>>>>>> caused by splits and compactions." >>>>>>>>> >>>>>>>>> Is there any chance to run into this problem in the read path for >>>> data >>>>>>>> that >>>>>>>>> is written infrequently and never changed? >>>>>>>>> >>>>>>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren < >> sto...@gmail.com >>>>> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> A small set of qualifiers will be accessed frequently so keeping >>>> them >>>>>> in >>>>>>>>>> block cache would be very beneficial. Some very seldom. So this >>>> sounds >>>>>>>> very >>>>>>>>>> promising! >>>>>>>>>> >>>>>>>>>> The reason why i'm considering a coprocessor is that I need to >>>> provide >>>>>>>>>> very specific information in the query request. Same thing with >> the >>>>>>>>>> response. Queries are also highly parallelizable across rows and >>>> each >>>>>>>>>> individual query produce a valid result that may or may not be >>>>>>>> aggregated >>>>>>>>>> with other results in the client, maybe even inside the region if >> it >>>>>>>>>> contained multiple rows targeted by the query. >>>>>>>>>> >>>>>>>>>> So it's a bit like Phoenix but with a different storage format and >>>>>> query >>>>>>>>>> engine. >>>>>>>>>> >>>>>>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <ndimi...@gmail.com >>> >>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Those rows are written out into HBase blocks on cell boundaries. >>>> Your >>>>>>>>>>> column family has a BLOCK_SIZE attribute, which you may or may >> have >>>>>> no >>>>>>>>>>> overridden the default of 64k. Cells are written into a block >> until >>>>>> is >>>>>>>> it >>>>>>>>>>>> = the target block size. So your single 500mb row will be broken >>>>>> down >>>>>>>>>>> into >>>>>>>>>>> thousands of HFile blocks in some number of HFiles. Some of those >>>>>>>> blocks >>>>>>>>>>> may contain just a cell or two and be a couple MB in size, to >> hold >>>>>> the >>>>>>>>>>> largest of your cells. Those blocks will be loaded into the Block >>>>>>>> Cache as >>>>>>>>>>> they're accessed. If your careful with your access patterns and >>>> only >>>>>>>>>>> request cells that you need to evaluate, you'll only ever load >> the >>>>>>>> blocks >>>>>>>>>>> containing those cells into the cache. >>>>>>>>>>> >>>>>>>>>>>> Will the entire row be loaded or only the qualifiers I ask for? >>>>>>>>>>> >>>>>>>>>>> So then, the answer to your question is: it depends on how you're >>>>>>>>>>> interacting with the row from your coprocessor. The read path >> will >>>>>> only >>>>>>>>>>> load blocks that your scanner requests. If your coprocessor is >>>>>>>> producing >>>>>>>>>>> scanner with to seek to specific qualifiers, you'll only load >> those >>>>>>>>>>> blocks. >>>>>>>>>>> >>>>>>>>>>> Related question: Is there a reason you're using a coprocessor >>>>>> instead >>>>>>>> of >>>>>>>>>>> a >>>>>>>>>>> regular filter, or a simple qualified get/scan to access data >> from >>>>>>>> these >>>>>>>>>>> rows? The "default stuff" is already tuned to load data sparsely, >>>> as >>>>>>>> would >>>>>>>>>>> be desirable for your schema. >>>>>>>>>>> >>>>>>>>>>> -n >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren < >>>> sto...@gmail.com >>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Sorry I should have explained my use case a bit more. >>>>>>>>>>>> >>>>>>>>>>>> Yes, it's a pretty big row and it's "close" to worst case. >>>> Normally >>>>>>>>>>> there >>>>>>>>>>>> would be fewer qualifiers and the largest qualifiers would be >>>>>> smaller. >>>>>>>>>>>> >>>>>>>>>>>> The reason why these rows gets big is because they stores >>>> aggregated >>>>>>>>>>> data >>>>>>>>>>>> in indexed compressed form. This format allow for extremely fast >>>>>>>> queries >>>>>>>>>>>> (on local disk format) over billions of rows (not rows in HBase >>>>>>>> speak), >>>>>>>>>>>> when touching smaller areas of the data. If would store the data >>>> as >>>>>>>>>>> regular >>>>>>>>>>>> HBase rows things would get very slow unless I had many many >>>> region >>>>>>>>>>>> servers. >>>>>>>>>>>> >>>>>>>>>>>> The coprocessor is used for doing custom queries on the indexed >>>> data >>>>>>>>>>> inside >>>>>>>>>>>> the region servers. These queries are not like a regular row >> scan, >>>>>> but >>>>>>>>>>> very >>>>>>>>>>>> specific as to how the data is formatted withing each column >>>>>>>> qualifier. >>>>>>>>>>>> >>>>>>>>>>>> Yes, this is not possible if HBase loads the whole 500MB each >>>> time i >>>>>>>>>>> want >>>>>>>>>>>> to perform this custom query on a row. Hence my question :-) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel < >>>>>>>>>>> michael_se...@hotmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Sorry, but your initial problem statement doesn’t seem to >> parse … >>>>>>>>>>>>> >>>>>>>>>>>>> Are you saying that you a single row with approximately 100,000 >>>>>>>>>>> elements >>>>>>>>>>>>> where each element is roughly 1-5KB in size and in addition >> there >>>>>> are >>>>>>>>>>> ~5 >>>>>>>>>>>>> elements which will be between one and five MB in size? >>>>>>>>>>>>> >>>>>>>>>>>>> And you then mention a coprocessor? >>>>>>>>>>>>> >>>>>>>>>>>>> Just looking at the numbers… 100K * 5KB means that each row >> would >>>>>> end >>>>>>>>>>> up >>>>>>>>>>>>> being 500MB in size. >>>>>>>>>>>>> >>>>>>>>>>>>> That’s a pretty fat row. >>>>>>>>>>>>> >>>>>>>>>>>>> I would suggest rethinking your strategy. >>>>>>>>>>>>> >>>>>>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren < >>>> sto...@gmail.com >>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have a row with around 100.000 qualifiers with mostly small >>>>>> values >>>>>>>>>>>>> around >>>>>>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do >>>>>>>>>>> random >>>>>>>>>>>>>> access of 1-10 qualifiers per row. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would like to understand how HBase loads the data into >> memory. >>>>>>>>>>> Will >>>>>>>>>>>> the >>>>>>>>>>>>>> entire row be loaded or only the qualifiers I ask for (like >>>>>> pointer >>>>>>>>>>>>> access >>>>>>>>>>>>>> into a direct ByteBuffer) ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> -Kristoffer >>>>>>>>>>>>> >>>>>>>>>>>>> The opinions expressed here are mine, while they may reflect a >>>>>>>>>>> cognitive >>>>>>>>>>>>> thought, that is purely accidental. >>>>>>>>>>>>> Use at your own risk. >>>>>>>>>>>>> Michael Segel >>>>>>>>>>>>> michael_segel (AT) hotmail.com >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> The opinions expressed here are mine, while they may reflect a >>>> cognitive >>>>>>>> thought, that is purely accidental. >>>>>>>> Use at your own risk. >>>>>>>> Michael Segel >>>>>>>> michael_segel (AT) hotmail.com >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> The opinions expressed here are mine, while they may reflect a >> cognitive >>>>>> thought, that is purely accidental. >>>>>> Use at your own risk. >>>>>> Michael Segel >>>>>> michael_segel (AT) hotmail.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> The opinions expressed here are mine, while they may reflect a cognitive >>>> thought, that is purely accidental. >>>> Use at your own risk. >>>> Michael Segel >>>> michael_segel (AT) hotmail.com >>>> >>>> >>>> >>>> >>>> >>>> >> >> The opinions expressed here are mine, while they may reflect a cognitive >> thought, that is purely accidental. >> Use at your own risk. >> Michael Segel >> michael_segel (AT) hotmail.com >> >> >> >> >> >> > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com