2010/6/19 Jonathan Gray <[email protected]>

> So there is no confusion, everything is sorted in HBase.  All columns in
> each family are sorted, always.
>

Thans a good news!. Thanks. I have no time (and enought knowlage of hbase)
to check this myself. No it's clear (and I use scan always for now).


>
> There are optimizations for Get queries (in 0.20 but gone in trunk) that
> make it so that what gets returned to the client is not completely sorted
> though it would be mostly sorted.

Is it true, that if i use Scan (even when scan is really get) in 0.20, i'll
got all things sorted?


> Are you returning millions of columns at once?  Otherwise it shouldn't be
> too expensive to do the sorted() call in the client.
>
I got a OOM when i try to build index (i have 1 index key which points to
5mil another keys, so I got OOM in server). With infrarow I can scan this
columns (in mr job mostly) to doing some work.
After I got OOM, i change schema to use compound keys. It is a bit
complicated to make such keys (instead of simple LongWritable and friends).
May be avro can help, but i don't try yet. With infra row I got slightly
complicated Result scan (i need to detect real key change), but this way is
less complicated, then compound keys.



>
> > -----Original Message-----
> > From: Andrey Stepachev [mailto:[email protected]]
> > Sent: Saturday, June 19, 2010 5:45 AM
> > To: [email protected]
> > Subject: Re: Sorting columns
> >
> > 2010/6/19 Stack <[email protected]>
> >
> > > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev <[email protected]>
> > > wrote:
> > > > As i see in sources there no place, where kv sorted (except client
> > > > Result.sorted() method). So we can get keyvalues from store and
> > from
> > > > memstore (and in this case we can get 1 3 5 from stores and 4 from
> > > memstore)
> > > > in incorrect order.
> > > >
> > > > Or I miss something?
> > > >
> > >
> > > Data is sorted in hbase.  Scanning, we'll be running a scanner
> > against
> > > each data store element -- memstore and one for each store file --
> > and
> > > we'll pop off the elements in order.  Thats the general story.  There
> > > may once have been a legitimate reason for the client-side sort --
> > > perhaps when our Get and Scan code paths differed it was needed --
> > but
> > > as to whether it still required, I'm not sure.  I'd have to dig.  Any
> > > one else?
> > >
> >
> > It is very interesting to know, is hbase guarantee ordering in columns.
> > Because if
> > someone will use very wide rows, in absence of sorting, it is not very
> > useful (and of course
> > someone should know about partitioning problem for wide rows).
> > Suppose, that we want to work with time data, in that case we can use
> > qualifiers as
> > date and expect data in sorted order and we can't order it somewhere
> > else,
> > because
> > we will lost most of hbase advantage.
> >
> >
> >
> > >
> > > >
> > > >> > The rest of the data needs to be accessed occasionally. We want
> > to
> > > avoid
> > > >> > getting it shipped to the client as it makes our map reduce job
> > go out
> > > of
> > > >> > memory.
> > > >> >
> > > >>
> > > >> You are not using incremental get on a row?  You should be able to
> > get
> > > >> your big rows piecemeal.
> > > >>
> > > > This scanner api changes was not included in 0.20.4 :( (infra row
> > > scanner).
> > > >
> > >
> > > Oh.
> > >
> > > Sorry about that Andrey.  Somehow we missed your backport of
> > > HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4 I'm
> > > rolling now.  Please excuse our bungling.
> > >
> >
> > Not a problem. I'll wait 0.20.5. But I should warn, that with this
> > patch
> > 0.20.5 will be not wire compatible with 0.20.4 (because this patch adds
> > additional
> > field in Scan, and this make Scan binary incompatible).
> >
> > I'm, personnaly, not using now infrarow scanner, because of unknown
> > ordering, i use
> > compound keys.
> > More over, infrarow scanning should use separate api (giving Result the
> > ability
> > to fetch additional kvs for given row) to be mo usable and easy to use.
>

Reply via email to