Re: Rowkey design question

Kristoffer Sjögren Wed, 08 Apr 2015 13:11:18 -0700

But if the coprocessor is omitted then CPU cycles from region servers are
lost, so where would the query execution go?


Queries needs to be quick (sub-second rather than seconds) and HDFS is
quite latency hungry, unless there are optimizations that i'm unaware of?



On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <michael_se...@hotmail.com>
wrote:

> I think you misunderstood.
>
> The suggestion was to put the data in to HDFS sequence files and to use
> HBase to store an index in to the file. (URL to the file, then offset in to
> the file for the start of the record…)
>
> The reason you want to do this is that you’re reading in large amounts of
> data and its more efficient to do this from HDFS than through HBase.
>
> > On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <sto...@gmail.com> wrote:
> >
> > Yes, I think you're right. Adding one or more dimensions to the rowkey
> > would indeed make the table narrower.
> >
> > And I guess it also make sense to store actual values (bigger qualifiers)
> > outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on
> SSD
> > caches would be an interesting solution. And quite a bit simpler.
> >
> > Good call and thanks for the tip! :-)
> >
> > On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <michael_se...@hotmail.com
> >
> > wrote:
> >
> >> Ok…
> >>
> >> First, I’d suggest you rethink your schema by adding an additional
> >> dimension.
> >> You’ll end up with more rows, but a narrower table.
> >>
> >> In terms of compaction… if the data is relatively static, you won’t have
> >> compactions because nothing changed.
> >> But if your data is that static… why not put the data in sequence files
> >> and use HBase as the index. Could be faster.
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <sto...@gmail.com>
> wrote:
> >>>
> >>> I just read through HBase MOB design document and one thing that caught
> >> my
> >>> attention was the following statement.
> >>>
> >>> "When HBase deals with large numbers of values > 100kb and up to ~10MB
> of
> >>> data, it encounters performance degradations due to write amplification
> >>> caused by splits and compactions."
> >>>
> >>> Is there any chance to run into this problem in the read path for data
> >> that
> >>> is written infrequently and never changed?
> >>>
> >>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <sto...@gmail.com>
> >> wrote:
> >>>
> >>>> A small set of qualifiers will be accessed frequently so keeping them
> in
> >>>> block cache would be very beneficial. Some very seldom. So this sounds
> >> very
> >>>> promising!
> >>>>
> >>>> The reason why i'm considering a coprocessor is that I need to provide
> >>>> very specific information in the query request. Same thing with the
> >>>> response. Queries are also highly parallelizable across rows and each
> >>>> individual query produce a valid result that may or may not be
> >> aggregated
> >>>> with other results in the client, maybe even inside the region if it
> >>>> contained multiple rows targeted by the query.
> >>>>
> >>>> So it's a bit like Phoenix but with a different storage format and
> query
> >>>> engine.
> >>>>
> >>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <ndimi...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Those rows are written out into HBase blocks on cell boundaries. Your
> >>>>> column family has a BLOCK_SIZE attribute, which you may or may have
> no
> >>>>> overridden the default of 64k. Cells are written into a block until
> is
> >> it
> >>>>>> = the target block size. So your single 500mb row will be broken
> down
> >>>>> into
> >>>>> thousands of HFile blocks in some number of HFiles. Some of those
> >> blocks
> >>>>> may contain just a cell or two and be a couple MB in size, to hold
> the
> >>>>> largest of your cells. Those blocks will be loaded into the Block
> >> Cache as
> >>>>> they're accessed. If your careful with your access patterns and only
> >>>>> request cells that you need to evaluate, you'll only ever load the
> >> blocks
> >>>>> containing those cells into the cache.
> >>>>>
> >>>>>> Will the entire row be loaded or only the qualifiers I ask for?
> >>>>>
> >>>>> So then, the answer to your question is: it depends on how you're
> >>>>> interacting with the row from your coprocessor. The read path will
> only
> >>>>> load blocks that your scanner requests. If your coprocessor is
> >> producing
> >>>>> scanner with to seek to specific qualifiers, you'll only load those
> >>>>> blocks.
> >>>>>
> >>>>> Related question: Is there a reason you're using a coprocessor
> instead
> >> of
> >>>>> a
> >>>>> regular filter, or a simple qualified get/scan to access data from
> >> these
> >>>>> rows? The "default stuff" is already tuned to load data sparsely, as
> >> would
> >>>>> be desirable for your schema.
> >>>>>
> >>>>> -n
> >>>>>
> >>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <sto...@gmail.com
> >
> >>>>> wrote:
> >>>>>
> >>>>>> Sorry I should have explained my use case a bit more.
> >>>>>>
> >>>>>> Yes, it's a pretty big row and it's "close" to worst case. Normally
> >>>>> there
> >>>>>> would be fewer qualifiers and the largest qualifiers would be
> smaller.
> >>>>>>
> >>>>>> The reason why these rows gets big is because they stores aggregated
> >>>>> data
> >>>>>> in indexed compressed form. This format allow for extremely fast
> >> queries
> >>>>>> (on local disk format) over billions of rows (not rows in HBase
> >> speak),
> >>>>>> when touching smaller areas of the data. If would store the data as
> >>>>> regular
> >>>>>> HBase rows things would get very slow unless I had many many region
> >>>>>> servers.
> >>>>>>
> >>>>>> The coprocessor is used for doing custom queries on the indexed data
> >>>>> inside
> >>>>>> the region servers. These queries are not like a regular row scan,
> but
> >>>>> very
> >>>>>> specific as to how the data is formatted withing each column
> >> qualifier.
> >>>>>>
> >>>>>> Yes, this is not possible if HBase loads the whole 500MB each time i
> >>>>> want
> >>>>>> to perform this custom query on a row. Hence my question :-)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
> >>>>> michael_se...@hotmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Sorry, but your initial problem statement doesn’t seem to parse …
> >>>>>>>
> >>>>>>> Are you saying that you a single row with approximately 100,000
> >>>>> elements
> >>>>>>> where each element is roughly 1-5KB in size and in addition there
> are
> >>>>> ~5
> >>>>>>> elements which will be between one and five MB in size?
> >>>>>>>
> >>>>>>> And you then mention a coprocessor?
> >>>>>>>
> >>>>>>> Just looking at the numbers… 100K * 5KB means that each row would
> end
> >>>>> up
> >>>>>>> being 500MB in size.
> >>>>>>>
> >>>>>>> That’s a pretty fat row.
> >>>>>>>
> >>>>>>> I would suggest rethinking your strategy.
> >>>>>>>
> >>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <sto...@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi
> >>>>>>>>
> >>>>>>>> I have a row with around 100.000 qualifiers with mostly small
> values
> >>>>>>> around
> >>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
> >>>>> random
> >>>>>>>> access of 1-10 qualifiers per row.
> >>>>>>>>
> >>>>>>>> I would like to understand how HBase loads the data into memory.
> >>>>> Will
> >>>>>> the
> >>>>>>>> entire row be loaded or only the qualifiers I ask for (like
> pointer
> >>>>>>> access
> >>>>>>>> into a direct ByteBuffer) ?
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> -Kristoffer
> >>>>>>>
> >>>>>>> The opinions expressed here are mine, while they may reflect a
> >>>>> cognitive
> >>>>>>> thought, that is purely accidental.
> >>>>>>> Use at your own risk.
> >>>>>>> Michael Segel
> >>>>>>> michael_segel (AT) hotmail.com
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: Rowkey design question

Reply via email to