Re: Configurable comparators per-CF?

Stack Fri, 18 Jun 2010 13:46:24 -0700

On Fri, Jun 18, 2010 at 11:43 AM, Jeff Hammerbacher <[email protected]> wrote:
> In Hadoop, it's possible to override how two keys are compared with
> WritableComparable (
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/WritableComparator.html),
> and the same thing is possible in Cassandra with CompareWith (
> http://wiki.apache.org/cassandra/StorageConfiguration).
>
> Would it be possible to do something similar for the unit of sorting in
> HBase, the ColumnFamily?
>


I think you are asking about row sort but you might be asking about
sort of columns inside a row.  Let me do the former first.

Regards key order, the answer is no, not w/o introducing a raft of complexity.

All tables hosted by the cluster are sorted on their row key.  Tables
are made of regions.

All cluster regions are listed in the catalog .META. table.  This
table is like any other, also sorted.   Its key is effectively
tablename+startkey-for-the-region.

As is, all is straight-forward when all use the same comparator.
Hunting the region that hosts a particular row is just a case of
comparing the sought-after key against the content of the catalog
table.  The client knows implicitly what comparator to use.

If we let each tablle have its own comparator, then client needs to
get metadata on each table, the comparator to use when looking for
rows inside the table, including the comparator to use when searching
the meta catalog table; only the comparator we used in the catalog
table would be a bit odd in that it would be a compound of a
comparator that first orders entries by tables and then, ordering the
region entries of a particular table in meta, it would need to use
that tables comparator ordering its rows.

Client doing lookups would need to be up-to-date on the metadata and
would also need to work the compound comparator to find host regions.

I suppose the above could be done if it were wanted.

Now, if you add into the mix a comparator per column family, scanning
a table, as we do now, you'd open a scanner per column family, only
now each scanner has its own comparator.  Lets take the case where one
CF's comparator sorts lexicographically from smallest to largest but
then the next column familly's comparator sorts in reverse.   A scan
that is to return all the content of a particular row -- i.e. across
all CFs --  is going go seek itself into the ground (I'd imagine).
Which comparator would we use ordering the rows in the table (for
smallest-to-largest comparator or the reverse-order comparator or
something else?)

On sort of columns within a row, which seems to be what the cassandra
citation is about  -- the partitioner determines row placement on the
cluster (is this a per-keyspace setting or one-time setting for the
cluster, I can't tell) -- then this looks like a feature we might want
to pick up, though, it seems like most of these comparisons can be
achieved through appropriate schema design (and we already have
sorting by timestamp).

St.Ack

Re: Configurable comparators per-CF?

Reply via email to