Hello,
I'm running Cassandra 0.6.0 on a cluster and have an application that
needs to read all rows from a column family using the Cassandra Thrift
API. Ideally, I'd like to be able to do this by having all nodes in the
cluster read in parallel (i.e., each node reads a disjoint set of rows
that are stored locally). I should also mention that I'm using the
RandomPartitioner.
Here's what I was thinking:
1. Have one node invoke describe_ring to find the token range on the
ring that each node is responsible for.
2. For each token range, have the node that owns that portion of the
ring read the rows in that range using a sequence of get_range_slices
calls (using start/end tokens, not keys).
This type of functionality seems to already be there in the tree with
the recent Cassandra/Hadoop integration.
...
KeyRange keyRange = new KeyRange(batchRowCount)
.setStart_token(startToken)
.setEnd_token(split.getEndToken());
try
{
rows = client.get_range_slices(new ColumnParent(cfName),
predicate,
keyRange,
ConsistencyLevel.ONE);
...
// prepare for the next slice to be read
KeySlice lastRow = rows.get(rows.size() - 1);
IPartitioner p = DatabaseDescriptor.getPartitioner();
byte[] rowkey = lastRow.getKey();
startToken = p.getTokenFactory().toString(p.getToken(rowkey));
...
The above snippet from ColumnFamilyRecordReader.java seems to suggest it
is possible to scan an entire column family by reading disjoint sets of
rows using token-based range queries (as opposed to key-based range
queries). Is this possible in 0.6.0? (Note: for the next startToken, I
was just planning on computing the MD5 digest of the last key directly
since I'm accessing Cassandra through Thrift.)
Thoughts?
bnc