2010/6/19 Jonathan Gray <[email protected]> > So there is no confusion, everything is sorted in HBase. All columns in > each family are sorted, always. >
Thans a good news!. Thanks. I have no time (and enought knowlage of hbase) to check this myself. No it's clear (and I use scan always for now). > > There are optimizations for Get queries (in 0.20 but gone in trunk) that > make it so that what gets returned to the client is not completely sorted > though it would be mostly sorted. Is it true, that if i use Scan (even when scan is really get) in 0.20, i'll got all things sorted? > Are you returning millions of columns at once? Otherwise it shouldn't be > too expensive to do the sorted() call in the client. > I got a OOM when i try to build index (i have 1 index key which points to 5mil another keys, so I got OOM in server). With infrarow I can scan this columns (in mr job mostly) to doing some work. After I got OOM, i change schema to use compound keys. It is a bit complicated to make such keys (instead of simple LongWritable and friends). May be avro can help, but i don't try yet. With infra row I got slightly complicated Result scan (i need to detect real key change), but this way is less complicated, then compound keys. > > > -----Original Message----- > > From: Andrey Stepachev [mailto:[email protected]] > > Sent: Saturday, June 19, 2010 5:45 AM > > To: [email protected] > > Subject: Re: Sorting columns > > > > 2010/6/19 Stack <[email protected]> > > > > > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev <[email protected]> > > > wrote: > > > > As i see in sources there no place, where kv sorted (except client > > > > Result.sorted() method). So we can get keyvalues from store and > > from > > > > memstore (and in this case we can get 1 3 5 from stores and 4 from > > > memstore) > > > > in incorrect order. > > > > > > > > Or I miss something? > > > > > > > > > > Data is sorted in hbase. Scanning, we'll be running a scanner > > against > > > each data store element -- memstore and one for each store file -- > > and > > > we'll pop off the elements in order. Thats the general story. There > > > may once have been a legitimate reason for the client-side sort -- > > > perhaps when our Get and Scan code paths differed it was needed -- > > but > > > as to whether it still required, I'm not sure. I'd have to dig. Any > > > one else? > > > > > > > It is very interesting to know, is hbase guarantee ordering in columns. > > Because if > > someone will use very wide rows, in absence of sorting, it is not very > > useful (and of course > > someone should know about partitioning problem for wide rows). > > Suppose, that we want to work with time data, in that case we can use > > qualifiers as > > date and expect data in sorted order and we can't order it somewhere > > else, > > because > > we will lost most of hbase advantage. > > > > > > > > > > > > > > > > >> > The rest of the data needs to be accessed occasionally. We want > > to > > > avoid > > > >> > getting it shipped to the client as it makes our map reduce job > > go out > > > of > > > >> > memory. > > > >> > > > > >> > > > >> You are not using incremental get on a row? You should be able to > > get > > > >> your big rows piecemeal. > > > >> > > > > This scanner api changes was not included in 0.20.4 :( (infra row > > > scanner). > > > > > > > > > > Oh. > > > > > > Sorry about that Andrey. Somehow we missed your backport of > > > HBASE-1537. I just applied it. It'll appear in the 0.20.5RC4 I'm > > > rolling now. Please excuse our bungling. > > > > > > > Not a problem. I'll wait 0.20.5. But I should warn, that with this > > patch > > 0.20.5 will be not wire compatible with 0.20.4 (because this patch adds > > additional > > field in Scan, and this make Scan binary incompatible). > > > > I'm, personnaly, not using now infrarow scanner, because of unknown > > ordering, i use > > compound keys. > > More over, infrarow scanning should use separate api (giving Result the > > ability > > to fetch additional kvs for given row) to be mo usable and easy to use. >
