RE: Sorting columns

Jonathan Gray Sat, 19 Jun 2010 09:22:11 -0700

So there is no confusion, everything is sorted in HBase.  All columns in each 
family are sorted, always.


There are optimizations for Get queries (in 0.20 but gone in trunk) that make 
it so that what gets returned to the client is not completely sorted though it 
would be mostly sorted.  Are you returning millions of columns at once?  
Otherwise it shouldn't be too expensive to do the sorted() call in the client.

> -----Original Message-----
> From: Andrey Stepachev [mailto:[email protected]]
> Sent: Saturday, June 19, 2010 5:45 AM
> To: [email protected]
> Subject: Re: Sorting columns
> 
> 2010/6/19 Stack <[email protected]>
> 
> > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev <[email protected]>
> > wrote:
> > > As i see in sources there no place, where kv sorted (except client
> > > Result.sorted() method). So we can get keyvalues from store and
> from
> > > memstore (and in this case we can get 1 3 5 from stores and 4 from
> > memstore)
> > > in incorrect order.
> > >
> > > Or I miss something?
> > >
> >
> > Data is sorted in hbase.  Scanning, we'll be running a scanner
> against
> > each data store element -- memstore and one for each store file --
> and
> > we'll pop off the elements in order.  Thats the general story.  There
> > may once have been a legitimate reason for the client-side sort --
> > perhaps when our Get and Scan code paths differed it was needed --
> but
> > as to whether it still required, I'm not sure.  I'd have to dig.  Any
> > one else?
> >
> 
> It is very interesting to know, is hbase guarantee ordering in columns.
> Because if
> someone will use very wide rows, in absence of sorting, it is not very
> useful (and of course
> someone should know about partitioning problem for wide rows).
> Suppose, that we want to work with time data, in that case we can use
> qualifiers as
> date and expect data in sorted order and we can't order it somewhere
> else,
> because
> we will lost most of hbase advantage.
> 
> 
> 
> >
> > >
> > >> > The rest of the data needs to be accessed occasionally. We want
> to
> > avoid
> > >> > getting it shipped to the client as it makes our map reduce job
> go out
> > of
> > >> > memory.
> > >> >
> > >>
> > >> You are not using incremental get on a row?  You should be able to
> get
> > >> your big rows piecemeal.
> > >>
> > > This scanner api changes was not included in 0.20.4 :( (infra row
> > scanner).
> > >
> >
> > Oh.
> >
> > Sorry about that Andrey.  Somehow we missed your backport of
> > HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4 I'm
> > rolling now.  Please excuse our bungling.
> >
> 
> Not a problem. I'll wait 0.20.5. But I should warn, that with this
> patch
> 0.20.5 will be not wire compatible with 0.20.4 (because this patch adds
> additional
> field in Scan, and this make Scan binary incompatible).
> 
> I'm, personnaly, not using now infrarow scanner, because of unknown
> ordering, i use
> compound keys.
> More over, infrarow scanning should use separate api (giving Result the
> ability
> to fetch additional kvs for given row) to be mo usable and easy to use.

RE: Sorting columns

Reply via email to