and will using keyprefixregionsplit policy instead of default Increasing to upperbound split policy help here?
On Wed, Aug 19, 2015 at 10:23 AM, Shushant Arora <[email protected]> wrote: > When last region gets new data and split in two - what is the split point > - say last reagion was having 10 files and split alogorithm decided to > split this region- > > Will the two children regions have 5-5 files or the key space of original > region(parent region) say have range (2015-08-01#guid to 2015-08-06#guid) > will be divided to 2 equal parts child1 has (2015-08-01#guid to > 2015-08-03#guids) and child2 has range (2015-08-04#guid to 2015-08-06#guid) > and all data is rewritten in child regions to accomany this key range and > then since its time series based so new data will come in increasing dates > and for dates>2015-08-06 only so will go to child2 and child1 wil always be > half filled. And child2 only will lead to new splits when reached split > size threshold. > > > > > > > On Wed, Aug 19, 2015 at 4:16 AM, Ted Yu <[email protected]> wrote: > >> Since year and month are part of the row key in this scenario (instead of >> just the day of month), the last region would get new data and be split. >> >> Is this effect desirable for your app ? >> >> Cheers >> >> On Tue, Aug 18, 2015 at 12:55 PM, Shushant Arora < >> [email protected]> >> wrote: >> >> > for hbase key containing time as prefix say(yyyy-mm-dd#other fields of >> guid >> > base) I am using bulk load to avoid hot spot of regionserver (avoiding >> > write to WAL). >> > >> > What should be the initial splits of regions. Say I have 30 >> regionserves. >> > >> > shall intial 30 days as intial splits and then auto split takes care of >> > splitting regions if it grows further will serve ? >> > Or since if it has date as prefix and when region is split in 2 from >> midway >> > - and new data will come for increasing date only will lead to one >> region >> > to be half filled always and rest half never filled? >> > >> > On Tue, Aug 18, 2015 at 9:41 PM, anil gupta <[email protected]> >> wrote: >> > >> > > As per my experience, Phoenix is way superior than Hive-HBase >> integration >> > > for sql-like querying on HBase. It's because, Phoenix is built on top >> of >> > > HBase unlike Hive. >> > > >> > > On Tue, Aug 18, 2015 at 9:09 AM, Ted Yu <[email protected]> wrote: >> > > >> > > > To my knowledge, Phoenix provides better integration with hbase. >> > > > >> > > > A third possibility is Spark on HBase. >> > > > >> > > > If you want to explore these alternatives, I suggest asking on >> > respective >> > > > mailing lists where you can get expert opinions. >> > > > >> > > > Cheers >> > > > >> > > > On Tue, Aug 18, 2015 at 9:03 AM, Shushant Arora < >> > > [email protected] >> > > > > >> > > > wrote: >> > > > >> > > > > Thanks! >> > > > > >> > > > > Which one is better for sqlkind of queries over hbase (queries >> > involve >> > > > > filter , key range scan), aggregates by column values. >> > > > > . >> > > > > 1.Hive storage handlers >> > > > > 2.or Phoenix >> > > > > >> > > > > On Tue, Aug 18, 2015 at 9:14 PM, Ted Yu <[email protected]> >> wrote: >> > > > > >> > > > > > For #1, if you want to count distinct values for F1, you can >> write >> > a >> > > > > > coprocessor which aggregates the count on region server and >> returns >> > > the >> > > > > > result to client which does the final aggregation. >> > > > > > >> > > > > > Take a look >> > > > > > at >> > > > > > >> > > > > >> > > > >> > > >> > >> hbase-server/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java >> > > > > > and related classes for example. >> > > > > > >> > > > > > On Mon, Aug 17, 2015 at 10:08 PM, Shushant Arora < >> > > > > > [email protected]> >> > > > > > wrote: >> > > > > > >> > > > > > > Thanks ! >> > > > > > > few more doubts : >> > > > > > > >> > > > > > > 1.Say if requirement is to count distinct value of F1- >> > > > > > > >> > > > > > > If field is part of key- is hbase can't just scan key and skip >> > > value >> > > > > > > deserialsation and return result to client which will >> calculate >> > > > > distinct >> > > > > > > and in second approcah Hbase will desrialise the value of >> return >> > > > column >> > > > > > > containing F1 to cleint which will calculate the distinct. >> > > > > > > >> > > > > > > 2.For bulk load when LoadIncrementalHFiles runs and >> regionserver >> > > > moves >> > > > > > the >> > > > > > > hfiles from hdfs to region directory - does regionserver >> localise >> > > the >> > > > > > hfile >> > > > > > > by downloading it to local and then uploading again in region >> > > > > directory? >> > > > > > Or >> > > > > > > it just moves to to region directory and wait for next >> compaction >> > > to >> > > > > get >> > > > > > it >> > > > > > > localise as in regionserver failure case? >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu <[email protected] >> > >> > > > wrote: >> > > > > > > >> > > > > > > > For both scenarios you mentioned, field is not leading part >> of >> > > row >> > > > > key. >> > > > > > > > You would need to specify timerange or start row / stop row >> to >> > > > narrow >> > > > > > the >> > > > > > > > key range being scanned. >> > > > > > > > >> > > > > > > > I am leaning toward using second approach. >> > > > > > > > >> > > > > > > > Cheers >> > > > > > > > >> > > > > > > > On Mon, Aug 17, 2015 at 9:41 AM, Shushant Arora < >> > > > > > > [email protected] >> > > > > > > > > >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > ~8-10 fields of size (5 of 20 bytes each )and 3 fields of >> > size >> > > > 200 >> > > > > > > bytes >> > > > > > > > > each. >> > > > > > > > > >> > > > > > > > > On Mon, Aug 17, 2015 at 9:55 PM, Ted Yu < >> [email protected] >> > > >> > > > > wrote: >> > > > > > > > > >> > > > > > > > > > How many fields such as F1 are you considering for >> > embedding >> > > in >> > > > > row >> > > > > > > > key ? >> > > > > > > > > > >> > > > > > > > > > Suggested reading: >> > > > > > > > > > http://hbase.apache.org/book.html#rowkey.design >> > > > > > > > > > http://hbase.apache.org/book.html#client.filter.kvm >> (see >> > > > > > > > > > ColumnPrefixFilter) >> > > > > > > > > > >> > > > > > > > > > Cheers >> > > > > > > > > > >> > > > > > > > > > On Mon, Aug 17, 2015 at 8:13 AM, Shushant Arora < >> > > > > > > > > [email protected] >> > > > > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > 1.so size limit is per cell's identifier + value ? >> > > > > > > > > > > >> > > > > > > > > > > What is more optimise - to have field in key or in >> column >> > > > > > family's >> > > > > > > > > > column ? >> > > > > > > > > > > If pattern is like every row has that field. >> > > > > > > > > > > >> > > > > > > > > > > Say I have a field F1 in all rows so >> > > > > > > > > > > Situtatio -1 >> > > > > > > > > > > key1#F1(as composite key) - and rest fields in column >> > > > > > > > > > > >> > > > > > > > > > > Situation-2 >> > > > > > > > > > > key1 as key and F1 part of column family. >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > This is the main reason I asked the key size limit. >> > > > > > > > > > > If I asked for no of rows where F1 is = 'someval' >> will it >> > > be >> > > > > > faster >> > > > > > > > in >> > > > > > > > > > > situation-1 than in situation-2. Since in 1 it can >> return >> > > the >> > > > > > > result >> > > > > > > > > just >> > > > > > > > > > > by traversing keys no need to read columns? >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu < >> > > [email protected] >> > > > > >> > > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > > For #1, it is the limit on a single keyvalue, not >> row, >> > > not >> > > > > key. >> > > > > > > > > > > > >> > > > > > > > > > > > For #2, please see the following: >> > > > > > > > > > > > >> > > > > > > > > > > > http://hbase.apache.org/book.html#store.memstore >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > > >> > http://hbase.apache.org/book.html#regionserver_splitting_implementation >> > > > > > > > > > > > >> > > > > > > > > > > > Cheers >> > > > > > > > > > > > >> > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:36 AM, Shushant Arora < >> > > > > > > > > > > [email protected] >> > > > > > > > > > > > > >> > > > > > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > 1.Is hbase.client.keyvalue.maxsize is max size of >> > row >> > > or >> > > > > key >> > > > > > > > only >> > > > > > > > > ? >> > > > > > > > > > Is >> > > > > > > > > > > > > there any limit on key size only ? >> > > > > > > > > > > > > 2.Access pattern is mostly on key based only- Is >> > > > memstores >> > > > > > and >> > > > > > > > > > regions >> > > > > > > > > > > > on a >> > > > > > > > > > > > > regionserver are per table basis? Is it if I have >> > > > multiple >> > > > > > > tables >> > > > > > > > > it >> > > > > > > > > > > will >> > > > > > > > > > > > > have multiple memstores instead of few if it would >> > have >> > > > > been >> > > > > > > one >> > > > > > > > > > large >> > > > > > > > > > > > > table ? >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu < >> > > > > [email protected] >> > > > > > > >> > > > > > > > > wrote: >> > > > > > > > > > > > > >> > > > > > > > > > > > > > For #1, take a look at the following in >> > > > > hbase-default.xml : >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > <name>hbase.client.keyvalue.maxsize</name> >> > > > > > > > > > > > > > <value>10485760</value> >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > For #2, it would be easier to answer if you can >> > > outline >> > > > > > > access >> > > > > > > > > > > patterns >> > > > > > > > > > > > > in >> > > > > > > > > > > > > > your app. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > For #3, adjustment according to current region >> > > > boundaries >> > > > > > is >> > > > > > > > done >> > > > > > > > > > > > client >> > > > > > > > > > > > > > side. Take a look at the javadoc for >> LoadQueueItem >> > > > > > > > > > > > > > in LoadIncrementalHFiles.java >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Cheers >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Mon, Aug 17, 2015 at 6:45 AM, Shushant Arora >> < >> > > > > > > > > > > > > [email protected] >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > 1.Is there any max limit on key size of hbase >> > > table. >> > > > > > > > > > > > > > > 2.Is multiple small tables vs one large table >> > which >> > > > one >> > > > > > is >> > > > > > > > > > > preferred. >> > > > > > > > > > > > > > > 3.for bulk load -when LoadIncremantalHfile is >> > run >> > > it >> > > > > > again >> > > > > > > > > > > > > recalculates >> > > > > > > > > > > > > > > the region splits based on region boundary - >> is >> > > this >> > > > > > > division >> > > > > > > > > > > happens >> > > > > > > > > > > > > on >> > > > > > > > > > > > > > > client side or server side again at region >> server >> > > or >> > > > > > hbase >> > > > > > > > > master >> > > > > > > > > > > and >> > > > > > > > > > > > > > then >> > > > > > > > > > > > > > > it assigns the splits which cross target >> region >> > > > > boundary >> > > > > > to >> > > > > > > > > > desired >> > > > > > > > > > > > > > > regionserver. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >> > > >> > > -- >> > > Thanks & Regards, >> > > Anil Gupta >> > > >> > >> > >
