Re: hbase doubts

Shushant Arora Tue, 18 Aug 2015 21:55:40 -0700

and will using keyprefixregionsplit policy instead of default Increasing to
upperbound split policy help here?


On Wed, Aug 19, 2015 at 10:23 AM, Shushant Arora <[email protected]>
wrote:

> When last region gets new data and split in two - what is the split point
> - say last reagion was having 10 files and split alogorithm decided to
> split this region-
>
> Will the two children regions have 5-5 files or the key space of original
> region(parent region) say have range (2015-08-01#guid to 2015-08-06#guid)
> will be divided to 2 equal parts child1 has (2015-08-01#guid to
> 2015-08-03#guids) and child2 has range (2015-08-04#guid to 2015-08-06#guid)
> and all data is  rewritten in child regions to accomany this key range and
> then since its time series based so new data will come in increasing dates
> and for dates>2015-08-06 only so will go to child2 and child1 wil always be
> half filled. And child2 only will lead to new splits when reached split
> size threshold.
>
>
>
>
>
>
> On Wed, Aug 19, 2015 at 4:16 AM, Ted Yu <[email protected]> wrote:
>
>> Since year and month are part of the row key in this scenario (instead of
>> just the day of month), the last region would get new data and be split.
>>
>> Is this effect desirable for your app ?
>>
>> Cheers
>>
>> On Tue, Aug 18, 2015 at 12:55 PM, Shushant Arora <
>> [email protected]>
>> wrote:
>>
>> > for hbase key containing time as prefix say(yyyy-mm-dd#other fields of
>> guid
>> > base) I am using bulk load to avoid hot spot of regionserver (avoiding
>> > write to WAL).
>> >
>> > What should be the initial splits of regions. Say I have 30
>> regionserves.
>> >
>> > shall intial 30 days as intial splits and then auto split takes care of
>> > splitting regions if it grows further will serve ?
>> > Or since if it has date as prefix and when region is split in 2 from
>> midway
>> > - and new data will come for increasing date only will lead to  one
>> region
>> > to be half filled always and rest half never filled?
>> >
>> > On Tue, Aug 18, 2015 at 9:41 PM, anil gupta <[email protected]>
>> wrote:
>> >
>> > > As per my experience, Phoenix is way superior than Hive-HBase
>> integration
>> > > for sql-like querying on HBase. It's because, Phoenix is built on top
>> of
>> > > HBase unlike Hive.
>> > >
>> > > On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu <[email protected]> wrote:
>> > >
>> > > > To my knowledge, Phoenix provides better integration with hbase.
>> > > >
>> > > > A third possibility is Spark on HBase.
>> > > >
>> > > > If you want to explore these alternatives, I suggest asking on
>> > respective
>> > > > mailing lists where you can get expert opinions.
>> > > >
>> > > > Cheers
>> > > >
>> > > > On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora <
>> > > [email protected]
>> > > > >
>> > > > wrote:
>> > > >
>> > > > > Thanks!
>> > > > >
>> > > > > Which one is better for sqlkind of queries over hbase (queries
>> > involve
>> > > > > filter , key range scan), aggregates by column values.
>> > > > > .
>> > > > > 1.Hive storage handlers
>> > > > > 2.or Phoenix
>> > > > >
>> > > > > On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu <[email protected]>
>> wrote:
>> > > > >
>> > > > > > For #1, if you want to count distinct values for F1, you can
>> write
>> > a
>> > > > > > coprocessor which aggregates the count on region server and
>> returns
>> > > the
>> > > > > > result to client which does the final aggregation.
>> > > > > >
>> > > > > > Take a look
>> > > > > > at
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java
>> > > > > > and related classes for example.
>> > > > > >
>> > > > > > On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora <
>> > > > > > [email protected]>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Thanks !
>> > > > > > > few more doubts :
>> > > > > > >
>> > > > > > > 1.Say if requirement is to count distinct value of F1-
>> > > > > > >
>> > > > > > > If field is part of key- is hbase can't just scan key and skip
>> > > value
>> > > > > > > deserialsation and return result to client which will
>> calculate
>> > > > > distinct
>> > > > > > > and in second approcah Hbase will desrialise the value of
>> return
>> > > > column
>> > > > > > > containing F1 to cleint which will calculate the distinct.
>> > > > > > >
>> > > > > > > 2.For bulk load when LoadIncrementalHFiles runs and
>> regionserver
>> > > > moves
>> > > > > > the
>> > > > > > > hfiles from hdfs to region directory - does regionserver
>> localise
>> > > the
>> > > > > > hfile
>> > > > > > > by downloading it to local and then uploading again in region
>> > > > > directory?
>> > > > > > Or
>> > > > > > > it just moves to to region directory and wait for next
>> compaction
>> > > to
>> > > > > get
>> > > > > > it
>> > > > > > > localise  as in regionserver failure case?
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu <[email protected]
>> >
>> > > > wrote:
>> > > > > > >
>> > > > > > > > For both scenarios you mentioned, field is not leading part
>> of
>> > > row
>> > > > > key.
>> > > > > > > > You would need to specify timerange or start row / stop row
>> to
>> > > > narrow
>> > > > > > the
>> > > > > > > > key range being scanned.
>> > > > > > > >
>> > > > > > > > I am leaning toward using second approach.
>> > > > > > > >
>> > > > > > > > Cheers
>> > > > > > > >
>> > > > > > > > On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora <
>> > > > > > > [email protected]
>> > > > > > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > ~8-10 fields of size (5 of  20 bytes each )and 3 fields of
>> > size
>> > > > 200
>> > > > > > > bytes
>> > > > > > > > > each.
>> > > > > > > > >
>> > > > > > > > > On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu <
>> [email protected]
>> > >
>> > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > How many fields such as F1 are you considering for
>> > embedding
>> > > in
>> > > > > row
>> > > > > > > > key ?
>> > > > > > > > > >
>> > > > > > > > > > Suggested reading:
>> > > > > > > > > > http://hbase.apache.org/book.html#rowkey.design
>> > > > > > > > > > http://hbase.apache.org/book.html#client.filter.kvm
>> (see
>> > > > > > > > > > ColumnPrefixFilter)
>> > > > > > > > > >
>> > > > > > > > > > Cheers
>> > > > > > > > > >
>> > > > > > > > > > On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora <
>> > > > > > > > > [email protected]
>> > > > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > 1.so size limit is per cell's identifier + value ?
>> > > > > > > > > > >
>> > > > > > > > > > > What is more optimise - to have field in key or in
>> column
>> > > > > > family's
>> > > > > > > > > > column ?
>> > > > > > > > > > > If pattern is like every row has that field.
>> > > > > > > > > > >
>> > > > > > > > > > > Say I have a field F1 in all rows so
>> > > > > > > > > > > Situtatio -1
>> > > > > > > > > > > key1#F1(as composite key)  - and rest fields in column
>> > > > > > > > > > >
>> > > > > > > > > > > Situation-2
>> > > > > > > > > > > key1 as key and F1 part of column family.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > This is the main reason I  asked the key size limit.
>> > > > > > > > > > > If I asked for no of rows where F1 is = 'someval'
>> will it
>> > > be
>> > > > > > faster
>> > > > > > > > in
>> > > > > > > > > > > situation-1 than in situation-2. Since in 1 it can
>> return
>> > > the
>> > > > > > > result
>> > > > > > > > > just
>> > > > > > > > > > > by traversing keys no need to read columns?
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu <
>> > > [email protected]
>> > > > >
>> > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > For #1, it is the limit on a single keyvalue, not
>> row,
>> > > not
>> > > > > key.
>> > > > > > > > > > > >
>> > > > > > > > > > > > For #2, please see the following:
>> > > > > > > > > > > >
>> > > > > > > > > > > > http://hbase.apache.org/book.html#store.memstore
>> > > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > >
>> > http://hbase.apache.org/book.html#regionserver_splitting_implementation
>> > > > > > > > > > > >
>> > > > > > > > > > > > Cheers
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora <
>> > > > > > > > > > > [email protected]
>> > > > > > > > > > > > >
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > 1.Is hbase.client.keyvalue.maxsize  is max size of
>> > row
>> > > or
>> > > > > key
>> > > > > > > > only
>> > > > > > > > > ?
>> > > > > > > > > > Is
>> > > > > > > > > > > > > there any limit on key size only ?
>> > > > > > > > > > > > > 2.Access pattern is mostly on key based only- Is
>> > > > memstores
>> > > > > > and
>> > > > > > > > > > regions
>> > > > > > > > > > > > on a
>> > > > > > > > > > > > > regionserver are per table basis? Is it if I have
>> > > > multiple
>> > > > > > > tables
>> > > > > > > > > it
>> > > > > > > > > > > will
>> > > > > > > > > > > > > have multiple memstores instead of few if it would
>> > have
>> > > > > been
>> > > > > > > one
>> > > > > > > > > > large
>> > > > > > > > > > > > > table ?
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu <
>> > > > > [email protected]
>> > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > For #1, take a look at the following in
>> > > > > hbase-default.xml :
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >     <name>hbase.client.keyvalue.maxsize</name>
>> > > > > > > > > > > > > >     <value>10485760</value>
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > For #2, it would be easier to answer if you can
>> > > outline
>> > > > > > > access
>> > > > > > > > > > > patterns
>> > > > > > > > > > > > > in
>> > > > > > > > > > > > > > your app.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > For #3, adjustment according to current region
>> > > > boundaries
>> > > > > > is
>> > > > > > > > done
>> > > > > > > > > > > > client
>> > > > > > > > > > > > > > side. Take a look at the javadoc for
>> LoadQueueItem
>> > > > > > > > > > > > > > in LoadIncrementalHFiles.java
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Cheers
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora
>> <
>> > > > > > > > > > > > > [email protected]
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > 1.Is there any max limit on key size of hbase
>> > > table.
>> > > > > > > > > > > > > > > 2.Is multiple small tables vs one large table
>> > which
>> > > > one
>> > > > > > is
>> > > > > > > > > > > preferred.
>> > > > > > > > > > > > > > > 3.for bulk load -when  LoadIncremantalHfile is
>> > run
>> > > it
>> > > > > > again
>> > > > > > > > > > > > > recalculates
>> > > > > > > > > > > > > > > the region splits based on region boundary -
>> is
>> > > this
>> > > > > > > division
>> > > > > > > > > > > happens
>> > > > > > > > > > > > > on
>> > > > > > > > > > > > > > > client side or server side again at region
>> server
>> > > or
>> > > > > > hbase
>> > > > > > > > > master
>> > > > > > > > > > > and
>> > > > > > > > > > > > > > then
>> > > > > > > > > > > > > > > it assigns the splits which cross target
>> region
>> > > > > boundary
>> > > > > > to
>> > > > > > > > > > desired
>> > > > > > > > > > > > > > > regionserver.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Thanks & Regards,
>> > > Anil Gupta
>> > >
>> >
>>
>
>

Re: hbase doubts

Reply via email to