Hi, no it's not important, only the stops are.
On Sun, Apr 21, 2013 at 3:34 AM, Ted Yu <[email protected]> wrote: > Thanks for sharing the information below. > > How do you plan to store time (when the bus gets to each stop) in the row ? > Or maybe it is not of importance to you ? > > On Sat, Apr 20, 2013 at 2:24 PM, Pal Konyves <[email protected]> > wrote: > > > I am making a paper for school about HBase, so the data I chose is not a > > real usable example. I am familiar with GTFS that is a de facto standard > > for storing information about public transportation schedules: when > vehicle > > arrives to a stop and where it goes toward. > > > > I chose to genrate the rows on the fly, where each row represents a > > sequence of 'bus' stops that make a route from the first stop until the > > last stop. > > e.g.: [first_stop_id,last_stop_id],string_sequence_of_stops > > where within the [...] is the rowkey. > > > > So long story short, I generate the data. I want to use the HBase java > > client api to store the rows with Put. I plan to randomize it by picking > > random first_stop_id-s, and use more threads. > > > > the rowkeys will still have a sequence, because the way I generate the > rows > > will output about 100-1000 rows starting with the same first_stop_id > within > > the rowkey. The total ammount of rows will be about billions, and would > > take up about 1TB. > > > > > > On Sat, Apr 20, 2013 at 10:54 PM, Ted Yu <[email protected]> wrote: > > > > > The answer to your first question is yes - midkey of the key range > would > > > be chosen as split key. > > > > > > For #2, can you tell us how you plan to randomize the loading ? > > > Bulk load normally means preparing HFiles which would be loaded > directly > > > into your table. > > > > > > Cheers > > > > > > On Apr 20, 2013, at 1:11 PM, Pal Konyves <[email protected]> > wrote: > > > > > > > Hi Ted, > > > > Only one family, my data is very simple key-value, although I want to > > > make > > > > sequential scan, so making a hash of the key is not an option. > > > > > > > > > > > > > > > > On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <[email protected]> > wrote: > > > > > > > >> How many column families do you have ? > > > >> > > > >> For #3, per-splitting table at the row keys corresponding to peaks > > makes > > > >> sense. > > > >> > > > >> On Apr 20, 2013, at 10:52 AM, Pal Konyves <[email protected]> > > > wrote: > > > >> > > > >>> Hi, > > > >>> > > > >>> I am just reading about region splitting. By default - as I > > understand > > > - > > > >>> Hbase handles splitting the regions. I just don't know how to > imagine > > > on > > > >>> which key it splits the regions. > > > >>> > > > >>> 1) For example when I write MD5 hash of rowkeys, they are most > > probably > > > >>> evenly distributed from > > > >>> 000000... to FFFFF... right? When Hbase starts with one region, > all > > > the > > > >>> writes goes into that region, and when the HFile get's too big, it > > just > > > >>> gets for example the median value of the stored keys, and split the > > > >> region > > > >>> by this? > > > >>> > > > >>> 2) I want to bulk load tons of data with the HBase java client API > > put > > > >>> operations. I want it to perform well. My keys are numeric > sequential > > > >>> values (which I know from this post, I cannot load into Hbase > > > >> sequentially, > > > >>> because the Hbase tables are going to be sad > > > >> > > > > > > http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/ > > > >>> ) > > > >>> So I thought I would pre-split the table into regions, and load the > > > data > > > >>> randomized. This way I will get good distribution among region > > servers > > > in > > > >>> terms of network IO from the beginning. Is that a good idea? > > > >>> > > > >>> 3) If my rowkeys are not evenly distributed in the keyspace, but > they > > > >> show > > > >>> some peaks or bursts. e.g. 000-999, but most of the keys gather > > around > > > >> 020 > > > >>> and 060 values, is it a good idea to have the pre region splits at > > > those > > > >>> peaks? > > > >>> > > > >>> Thanks in advance, > > > >>> Pal > > > >> > > > > > >
