Ok, I get that, I'll have to find another way to sort out new rows.
Your description makes me think that if new rows are added during the
paging (i.e. between one select with token()'s and another), they might
show up in the query results, right? (because the hash of the new row
keys might fall sequentially after token(last_processed_row))
On 08/06/2013 08:18 AM, Richard Low wrote:
On 6 August 2013 15:12, Keith Freeman <[email protected]
<mailto:[email protected]>> wrote:
I've seen in several places the advice to use queries like to this
page through lots of rows:
select id from mytable where token(id) > token(last_id)
But it's hard to find detailed information about how this works
(at least that I can understand -- the description in the
Cassandra manual is pretty brief).
One thing I'd like to know is if new rows are always guaranteed to
have token(new_id) > token(ids-of-all-previous-rows)? E.g. if I
have one process that adds rows to a table, and another that
processes rows from the table, can the "processor" save the id of
the last row processed and when he wakes up use:
select * from mytable where token(id) > token(last_processed_id)
to process only new rows? Will this always work to get only new rows?
No, unfortunately not. The tokens are generated by the partitioner -
they are the hash of the row key. New tokens could be anywhere in the
range of tokens so you can't use token ordering to find new rows.
The query you suggest works to page through all the data in your
column family. Rows will be returned regardless of when they were
added (as long as they were added before the query started). Finding
rows that have been added since a certain time is hard in Cassandra
since they are stored in token order. In general you have to read
through all the data and work out from e.g. a date field if they
should be treated as new.
Richard.