RE: Sorting columns

Jonathan Gray Mon, 21 Jun 2010 10:09:18 -0700

There will be a development release sometime next week but that will not be 
recommended for production usage.


There is no release date for the full version but I think we're hoping to have 
a release candidate before the end of July.

> -----Original Message-----
> From: Vaibhav Puranik [mailto:[email protected]]
> Sent: Monday, June 21, 2010 9:48 AM
> To: [email protected]
> Subject: Re: Sorting columns
> 
> Jon, Stack,
> 
> Is there a tentative date when this version (with column scanner) is
> coming
> out?
> 
> Vaibhav
> 
> On Mon, Jun 21, 2010 at 9:28 AM, Jonathan Gray <[email protected]>
> wrote:
> 
> > Yes, when using Scan, even on 0.20, everything will be sorted.
> >
> > Re: OOM, you'll need more memory or you'll need to break stuff up
> across
> > rows.  Not much else to be done about that :)
> >
> > > -----Original Message-----
> > > From: Andrey Stepachev [mailto:[email protected]]
> > > Sent: Monday, June 21, 2010 6:40 AM
> > > To: [email protected]
> > > Subject: Re: Sorting columns
> > >
> > > 2010/6/19 Jonathan Gray <[email protected]>
> > >
> > > > So there is no confusion, everything is sorted in HBase.  All
> columns
> > > in
> > > > each family are sorted, always.
> > > >
> > >
> > > Thans a good news!. Thanks. I have no time (and enought knowlage of
> > > hbase)
> > > to check this myself. No it's clear (and I use scan always for
> now).
> > >
> > >
> > > >
> > > > There are optimizations for Get queries (in 0.20 but gone in
> trunk)
> > > that
> > > > make it so that what gets returned to the client is not
> completely
> > > sorted
> > > > though it would be mostly sorted.
> > >
> > > Is it true, that if i use Scan (even when scan is really get) in
> 0.20,
> > > i'll
> > > got all things sorted?
> > >
> > >
> > > > Are you returning millions of columns at once?  Otherwise it
> > > shouldn't be
> > > > too expensive to do the sorted() call in the client.
> > > >
> > > I got a OOM when i try to build index (i have 1 index key which
> points
> > > to
> > > 5mil another keys, so I got OOM in server). With infrarow I can
> scan
> > > this
> > > columns (in mr job mostly) to doing some work.
> > > After I got OOM, i change schema to use compound keys. It is a bit
> > > complicated to make such keys (instead of simple LongWritable and
> > > friends).
> > > May be avro can help, but i don't try yet. With infra row I got
> > > slightly
> > > complicated Result scan (i need to detect real key change), but
> this
> > > way is
> > > less complicated, then compound keys.
> > >
> > >
> > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Andrey Stepachev [mailto:[email protected]]
> > > > > Sent: Saturday, June 19, 2010 5:45 AM
> > > > > To: [email protected]
> > > > > Subject: Re: Sorting columns
> > > > >
> > > > > 2010/6/19 Stack <[email protected]>
> > > > >
> > > > > > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev
> > > <[email protected]>
> > > > > > wrote:
> > > > > > > As i see in sources there no place, where kv sorted (except
> > > client
> > > > > > > Result.sorted() method). So we can get keyvalues from store
> and
> > > > > from
> > > > > > > memstore (and in this case we can get 1 3 5 from stores and
> 4
> > > from
> > > > > > memstore)
> > > > > > > in incorrect order.
> > > > > > >
> > > > > > > Or I miss something?
> > > > > > >
> > > > > >
> > > > > > Data is sorted in hbase.  Scanning, we'll be running a
> scanner
> > > > > against
> > > > > > each data store element -- memstore and one for each store
> file -
> > > -
> > > > > and
> > > > > > we'll pop off the elements in order.  Thats the general
> story.
> > > There
> > > > > > may once have been a legitimate reason for the client-side
> sort -
> > > -
> > > > > > perhaps when our Get and Scan code paths differed it was
> needed -
> > > -
> > > > > but
> > > > > > as to whether it still required, I'm not sure.  I'd have to
> dig.
> > > Any
> > > > > > one else?
> > > > > >
> > > > >
> > > > > It is very interesting to know, is hbase guarantee ordering in
> > > columns.
> > > > > Because if
> > > > > someone will use very wide rows, in absence of sorting, it is
> not
> > > very
> > > > > useful (and of course
> > > > > someone should know about partitioning problem for wide rows).
> > > > > Suppose, that we want to work with time data, in that case we
> can
> > > use
> > > > > qualifiers as
> > > > > date and expect data in sorted order and we can't order it
> > > somewhere
> > > > > else,
> > > > > because
> > > > > we will lost most of hbase advantage.
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >> > The rest of the data needs to be accessed occasionally.
> We
> > > want
> > > > > to
> > > > > > avoid
> > > > > > >> > getting it shipped to the client as it makes our map
> reduce
> > > job
> > > > > go out
> > > > > > of
> > > > > > >> > memory.
> > > > > > >> >
> > > > > > >>
> > > > > > >> You are not using incremental get on a row?  You should be
> > > able to
> > > > > get
> > > > > > >> your big rows piecemeal.
> > > > > > >>
> > > > > > > This scanner api changes was not included in 0.20.4 :(
> (infra
> > > row
> > > > > > scanner).
> > > > > > >
> > > > > >
> > > > > > Oh.
> > > > > >
> > > > > > Sorry about that Andrey.  Somehow we missed your backport of
> > > > > > HBASE-1537.  I just applied it.  It'll appear in the
> 0.20.5RC4
> > > I'm
> > > > > > rolling now.  Please excuse our bungling.
> > > > > >
> > > > >
> > > > > Not a problem. I'll wait 0.20.5. But I should warn, that with
> this
> > > > > patch
> > > > > 0.20.5 will be not wire compatible with 0.20.4 (because this
> patch
> > > adds
> > > > > additional
> > > > > field in Scan, and this make Scan binary incompatible).
> > > > >
> > > > > I'm, personnaly, not using now infrarow scanner, because of
> unknown
> > > > > ordering, i use
> > > > > compound keys.
> > > > > More over, infrarow scanning should use separate api (giving
> Result
> > > the
> > > > > ability
> > > > > to fetch additional kvs for given row) to be mo usable and easy
> to
> > > use.
> > > >
> >

RE: Sorting columns

Reply via email to