But if the coprocessor is omitted then CPU cycles from region servers are lost, so where would the query execution go?
Queries needs to be quick (sub-second rather than seconds) and HDFS is quite latency hungry, unless there are optimizations that i'm unaware of? On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <michael_se...@hotmail.com> wrote: > I think you misunderstood. > > The suggestion was to put the data in to HDFS sequence files and to use > HBase to store an index in to the file. (URL to the file, then offset in to > the file for the start of the record…) > > The reason you want to do this is that you’re reading in large amounts of > data and its more efficient to do this from HDFS than through HBase. > > > On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <sto...@gmail.com> wrote: > > > > Yes, I think you're right. Adding one or more dimensions to the rowkey > > would indeed make the table narrower. > > > > And I guess it also make sense to store actual values (bigger qualifiers) > > outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on > SSD > > caches would be an interesting solution. And quite a bit simpler. > > > > Good call and thanks for the tip! :-) > > > > On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <michael_se...@hotmail.com > > > > wrote: > > > >> Ok… > >> > >> First, I’d suggest you rethink your schema by adding an additional > >> dimension. > >> You’ll end up with more rows, but a narrower table. > >> > >> In terms of compaction… if the data is relatively static, you won’t have > >> compactions because nothing changed. > >> But if your data is that static… why not put the data in sequence files > >> and use HBase as the index. Could be faster. > >> > >> HTH > >> > >> -Mike > >> > >>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <sto...@gmail.com> > wrote: > >>> > >>> I just read through HBase MOB design document and one thing that caught > >> my > >>> attention was the following statement. > >>> > >>> "When HBase deals with large numbers of values > 100kb and up to ~10MB > of > >>> data, it encounters performance degradations due to write amplification > >>> caused by splits and compactions." > >>> > >>> Is there any chance to run into this problem in the read path for data > >> that > >>> is written infrequently and never changed? > >>> > >>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <sto...@gmail.com> > >> wrote: > >>> > >>>> A small set of qualifiers will be accessed frequently so keeping them > in > >>>> block cache would be very beneficial. Some very seldom. So this sounds > >> very > >>>> promising! > >>>> > >>>> The reason why i'm considering a coprocessor is that I need to provide > >>>> very specific information in the query request. Same thing with the > >>>> response. Queries are also highly parallelizable across rows and each > >>>> individual query produce a valid result that may or may not be > >> aggregated > >>>> with other results in the client, maybe even inside the region if it > >>>> contained multiple rows targeted by the query. > >>>> > >>>> So it's a bit like Phoenix but with a different storage format and > query > >>>> engine. > >>>> > >>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <ndimi...@gmail.com> > >> wrote: > >>>> > >>>>> Those rows are written out into HBase blocks on cell boundaries. Your > >>>>> column family has a BLOCK_SIZE attribute, which you may or may have > no > >>>>> overridden the default of 64k. Cells are written into a block until > is > >> it > >>>>>> = the target block size. So your single 500mb row will be broken > down > >>>>> into > >>>>> thousands of HFile blocks in some number of HFiles. Some of those > >> blocks > >>>>> may contain just a cell or two and be a couple MB in size, to hold > the > >>>>> largest of your cells. Those blocks will be loaded into the Block > >> Cache as > >>>>> they're accessed. If your careful with your access patterns and only > >>>>> request cells that you need to evaluate, you'll only ever load the > >> blocks > >>>>> containing those cells into the cache. > >>>>> > >>>>>> Will the entire row be loaded or only the qualifiers I ask for? > >>>>> > >>>>> So then, the answer to your question is: it depends on how you're > >>>>> interacting with the row from your coprocessor. The read path will > only > >>>>> load blocks that your scanner requests. If your coprocessor is > >> producing > >>>>> scanner with to seek to specific qualifiers, you'll only load those > >>>>> blocks. > >>>>> > >>>>> Related question: Is there a reason you're using a coprocessor > instead > >> of > >>>>> a > >>>>> regular filter, or a simple qualified get/scan to access data from > >> these > >>>>> rows? The "default stuff" is already tuned to load data sparsely, as > >> would > >>>>> be desirable for your schema. > >>>>> > >>>>> -n > >>>>> > >>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <sto...@gmail.com > > > >>>>> wrote: > >>>>> > >>>>>> Sorry I should have explained my use case a bit more. > >>>>>> > >>>>>> Yes, it's a pretty big row and it's "close" to worst case. Normally > >>>>> there > >>>>>> would be fewer qualifiers and the largest qualifiers would be > smaller. > >>>>>> > >>>>>> The reason why these rows gets big is because they stores aggregated > >>>>> data > >>>>>> in indexed compressed form. This format allow for extremely fast > >> queries > >>>>>> (on local disk format) over billions of rows (not rows in HBase > >> speak), > >>>>>> when touching smaller areas of the data. If would store the data as > >>>>> regular > >>>>>> HBase rows things would get very slow unless I had many many region > >>>>>> servers. > >>>>>> > >>>>>> The coprocessor is used for doing custom queries on the indexed data > >>>>> inside > >>>>>> the region servers. These queries are not like a regular row scan, > but > >>>>> very > >>>>>> specific as to how the data is formatted withing each column > >> qualifier. > >>>>>> > >>>>>> Yes, this is not possible if HBase loads the whole 500MB each time i > >>>>> want > >>>>>> to perform this custom query on a row. Hence my question :-) > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel < > >>>>> michael_se...@hotmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Sorry, but your initial problem statement doesn’t seem to parse … > >>>>>>> > >>>>>>> Are you saying that you a single row with approximately 100,000 > >>>>> elements > >>>>>>> where each element is roughly 1-5KB in size and in addition there > are > >>>>> ~5 > >>>>>>> elements which will be between one and five MB in size? > >>>>>>> > >>>>>>> And you then mention a coprocessor? > >>>>>>> > >>>>>>> Just looking at the numbers… 100K * 5KB means that each row would > end > >>>>> up > >>>>>>> being 500MB in size. > >>>>>>> > >>>>>>> That’s a pretty fat row. > >>>>>>> > >>>>>>> I would suggest rethinking your strategy. > >>>>>>> > >>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <sto...@gmail.com > > > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hi > >>>>>>>> > >>>>>>>> I have a row with around 100.000 qualifiers with mostly small > values > >>>>>>> around > >>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do > >>>>> random > >>>>>>>> access of 1-10 qualifiers per row. > >>>>>>>> > >>>>>>>> I would like to understand how HBase loads the data into memory. > >>>>> Will > >>>>>> the > >>>>>>>> entire row be loaded or only the qualifiers I ask for (like > pointer > >>>>>>> access > >>>>>>>> into a direct ByteBuffer) ? > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> -Kristoffer > >>>>>>> > >>>>>>> The opinions expressed here are mine, while they may reflect a > >>>>> cognitive > >>>>>>> thought, that is purely accidental. > >>>>>>> Use at your own risk. > >>>>>>> Michael Segel > >>>>>>> michael_segel (AT) hotmail.com > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >> > >> The opinions expressed here are mine, while they may reflect a cognitive > >> thought, that is purely accidental. > >> Use at your own risk. > >> Michael Segel > >> michael_segel (AT) hotmail.com > >> > >> > >> > >> > >> > >> > > The opinions expressed here are mine, while they may reflect a cognitive > thought, that is purely accidental. > Use at your own risk. > Michael Segel > michael_segel (AT) hotmail.com > > > > > >