Hello,

I'm running Cassandra 0.6.0 on a cluster and have an application that needs to read all rows from a column family using the Cassandra Thrift API. Ideally, I'd like to be able to do this by having all nodes in the cluster read in parallel (i.e., each node reads a disjoint set of rows that are stored locally). I should also mention that I'm using the RandomPartitioner.

Here's what I was thinking:

1. Have one node invoke describe_ring to find the token range on the ring that each node is responsible for.

2. For each token range, have the node that owns that portion of the ring read the rows in that range using a sequence of get_range_slices calls (using start/end tokens, not keys).

This type of functionality seems to already be there in the tree with the recent Cassandra/Hadoop integration.

...
KeyRange keyRange = new KeyRange(batchRowCount)
        .setStart_token(startToken)
        .setEnd_token(split.getEndToken());
try
{
    rows = client.get_range_slices(new ColumnParent(cfName),
           predicate,
           keyRange,
           ConsistencyLevel.ONE);
     ...

    // prepare for the next slice to be read
    KeySlice lastRow = rows.get(rows.size() - 1);
    IPartitioner p = DatabaseDescriptor.getPartitioner();
    byte[] rowkey = lastRow.getKey();
    startToken = p.getTokenFactory().toString(p.getToken(rowkey));
...

The above snippet from ColumnFamilyRecordReader.java seems to suggest it is possible to scan an entire column family by reading disjoint sets of rows using token-based range queries (as opposed to key-based range queries). Is this possible in 0.6.0? (Note: for the next startToken, I was just planning on computing the MD5 digest of the last key directly since I'm accessing Cassandra through Thrift.)

Thoughts?

bnc

Reply via email to