Reading all rows in a column family in parallel

Brent N. Chun Thu, 08 Jul 2010 00:22:26 -0700

Hello,

I'm running Cassandra 0.6.0 on a cluster and have an application thatneeds to read all rows from a column family using the Cassandra ThriftAPI. Ideally, I'd like to be able to do this by having all nodes in thecluster read in parallel (i.e., each node reads a disjoint set of rowsthat are stored locally). I should also mention that I'm using theRandomPartitioner.


Here's what I was thinking:

1. Have one node invoke describe_ring to find the token range on thering that each node is responsible for.

2. For each token range, have the node that owns that portion of thering read the rows in that range using a sequence of get_range_slicescalls (using start/end tokens, not keys).

This type of functionality seems to already be there in the tree withthe recent Cassandra/Hadoop integration.


...
KeyRange keyRange = new KeyRange(batchRowCount)
        .setStart_token(startToken)
        .setEnd_token(split.getEndToken());
try
{
    rows = client.get_range_slices(new ColumnParent(cfName),
           predicate,
           keyRange,
           ConsistencyLevel.ONE);
     ...

    // prepare for the next slice to be read
    KeySlice lastRow = rows.get(rows.size() - 1);
    IPartitioner p = DatabaseDescriptor.getPartitioner();
    byte[] rowkey = lastRow.getKey();
    startToken = p.getTokenFactory().toString(p.getToken(rowkey));
...

The above snippet from ColumnFamilyRecordReader.java seems to suggest itis possible to scan an entire column family by reading disjoint sets ofrows using token-based range queries (as opposed to key-based rangequeries). Is this possible in 0.6.0? (Note: for the next startToken, Iwas just planning on computing the MD5 digest of the last key directlysince I'm accessing Cassandra through Thrift.)


Thoughts?

bnc

Reading all rows in a column family in parallel

Reply via email to