Since year and month are part of the row key in this scenario (instead of just the day of month), the last region would get new data and be split.
Is this effect desirable for your app ? Cheers On Tue, Aug 18, 2015 at 12:55 PM, Shushant Arora <[email protected]> wrote: > for hbase key containing time as prefix say(yyyy-mm-dd#other fields of guid > base) I am using bulk load to avoid hot spot of regionserver (avoiding > write to WAL). > > What should be the initial splits of regions. Say I have 30 regionserves. > > shall intial 30 days as intial splits and then auto split takes care of > splitting regions if it grows further will serve ? > Or since if it has date as prefix and when region is split in 2 from midway > - and new data will come for increasing date only will lead to one region > to be half filled always and rest half never filled? > > On Tue, Aug 18, 2015 at 9:41 PM, anil gupta <[email protected]> wrote: > > > As per my experience, Phoenix is way superior than Hive-HBase integration > > for sql-like querying on HBase. It's because, Phoenix is built on top of > > HBase unlike Hive. > > > > On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu <[email protected]> wrote: > > > > > To my knowledge, Phoenix provides better integration with hbase. > > > > > > A third possibility is Spark on HBase. > > > > > > If you want to explore these alternatives, I suggest asking on > respective > > > mailing lists where you can get expert opinions. > > > > > > Cheers > > > > > > On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora < > > [email protected] > > > > > > > wrote: > > > > > > > Thanks! > > > > > > > > Which one is better for sqlkind of queries over hbase (queries > involve > > > > filter , key range scan), aggregates by column values. > > > > . > > > > 1.Hive storage handlers > > > > 2.or Phoenix > > > > > > > > On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu <[email protected]> wrote: > > > > > > > > > For #1, if you want to count distinct values for F1, you can write > a > > > > > coprocessor which aggregates the count on region server and returns > > the > > > > > result to client which does the final aggregation. > > > > > > > > > > Take a look > > > > > at > > > > > > > > > > > > > > > hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java > > > > > and related classes for example. > > > > > > > > > > On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora < > > > > > [email protected]> > > > > > wrote: > > > > > > > > > > > Thanks ! > > > > > > few more doubts : > > > > > > > > > > > > 1.Say if requirement is to count distinct value of F1- > > > > > > > > > > > > If field is part of key- is hbase can't just scan key and skip > > value > > > > > > deserialsation and return result to client which will calculate > > > > distinct > > > > > > and in second approcah Hbase will desrialise the value of return > > > column > > > > > > containing F1 to cleint which will calculate the distinct. > > > > > > > > > > > > 2.For bulk load when LoadIncrementalHFiles runs and regionserver > > > moves > > > > > the > > > > > > hfiles from hdfs to region directory - does regionserver localise > > the > > > > > hfile > > > > > > by downloading it to local and then uploading again in region > > > > directory? > > > > > Or > > > > > > it just moves to to region directory and wait for next compaction > > to > > > > get > > > > > it > > > > > > localise as in regionserver failure case? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu <[email protected]> > > > wrote: > > > > > > > > > > > > > For both scenarios you mentioned, field is not leading part of > > row > > > > key. > > > > > > > You would need to specify timerange or start row / stop row to > > > narrow > > > > > the > > > > > > > key range being scanned. > > > > > > > > > > > > > > I am leaning toward using second approach. > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora < > > > > > > [email protected] > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > ~8-10 fields of size (5 of 20 bytes each )and 3 fields of > size > > > 200 > > > > > > bytes > > > > > > > > each. > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu <[email protected] > > > > > > wrote: > > > > > > > > > > > > > > > > > How many fields such as F1 are you considering for > embedding > > in > > > > row > > > > > > > key ? > > > > > > > > > > > > > > > > > > Suggested reading: > > > > > > > > > http://hbase.apache.org/book.html#rowkey.design > > > > > > > > > http://hbase.apache.org/book.html#client.filter.kvm (see > > > > > > > > > ColumnPrefixFilter) > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora < > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > 1.so size limit is per cell's identifier + value ? > > > > > > > > > > > > > > > > > > > > What is more optimise - to have field in key or in column > > > > > family's > > > > > > > > > column ? > > > > > > > > > > If pattern is like every row has that field. > > > > > > > > > > > > > > > > > > > > Say I have a field F1 in all rows so > > > > > > > > > > Situtatio -1 > > > > > > > > > > key1#F1(as composite key) - and rest fields in column > > > > > > > > > > > > > > > > > > > > Situation-2 > > > > > > > > > > key1 as key and F1 part of column family. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is the main reason I asked the key size limit. > > > > > > > > > > If I asked for no of rows where F1 is = 'someval' will it > > be > > > > > faster > > > > > > > in > > > > > > > > > > situation-1 than in situation-2. Since in 1 it can return > > the > > > > > > result > > > > > > > > just > > > > > > > > > > by traversing keys no need to read columns? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > For #1, it is the limit on a single keyvalue, not row, > > not > > > > key. > > > > > > > > > > > > > > > > > > > > > > For #2, please see the following: > > > > > > > > > > > > > > > > > > > > > > http://hbase.apache.org/book.html#store.memstore > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://hbase.apache.org/book.html#regionserver_splitting_implementation > > > > > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora < > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > 1.Is hbase.client.keyvalue.maxsize is max size of > row > > or > > > > key > > > > > > > only > > > > > > > > ? > > > > > > > > > Is > > > > > > > > > > > > there any limit on key size only ? > > > > > > > > > > > > 2.Access pattern is mostly on key based only- Is > > > memstores > > > > > and > > > > > > > > > regions > > > > > > > > > > > on a > > > > > > > > > > > > regionserver are per table basis? Is it if I have > > > multiple > > > > > > tables > > > > > > > > it > > > > > > > > > > will > > > > > > > > > > > > have multiple memstores instead of few if it would > have > > > > been > > > > > > one > > > > > > > > > large > > > > > > > > > > > > table ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu < > > > > [email protected] > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > For #1, take a look at the following in > > > > hbase-default.xml : > > > > > > > > > > > > > > > > > > > > > > > > > > <name>hbase.client.keyvalue.maxsize</name> > > > > > > > > > > > > > <value>10485760</value> > > > > > > > > > > > > > > > > > > > > > > > > > > For #2, it would be easier to answer if you can > > outline > > > > > > access > > > > > > > > > > patterns > > > > > > > > > > > > in > > > > > > > > > > > > > your app. > > > > > > > > > > > > > > > > > > > > > > > > > > For #3, adjustment according to current region > > > boundaries > > > > > is > > > > > > > done > > > > > > > > > > > client > > > > > > > > > > > > > side. Take a look at the javadoc for LoadQueueItem > > > > > > > > > > > > > in LoadIncrementalHFiles.java > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora < > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.Is there any max limit on key size of hbase > > table. > > > > > > > > > > > > > > 2.Is multiple small tables vs one large table > which > > > one > > > > > is > > > > > > > > > > preferred. > > > > > > > > > > > > > > 3.for bulk load -when LoadIncremantalHfile is > run > > it > > > > > again > > > > > > > > > > > > recalculates > > > > > > > > > > > > > > the region splits based on region boundary - is > > this > > > > > > division > > > > > > > > > > happens > > > > > > > > > > > > on > > > > > > > > > > > > > > client side or server side again at region server > > or > > > > > hbase > > > > > > > > master > > > > > > > > > > and > > > > > > > > > > > > > then > > > > > > > > > > > > > > it assigns the splits which cross target region > > > > boundary > > > > > to > > > > > > > > > desired > > > > > > > > > > > > > > regionserver. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Thanks & Regards, > > Anil Gupta > > >
