Hi there,
We run a 3 node cluster with 0.7.8 with replication factor 3 for all key
spaces.
We store external->internal key mappings in a column family with one row
for each customer. The largest row contains abount 200k columns.
If we import external data we load the whole row and map external to
internal keys. Loading is done like
SliceQuery<String, Key, Mapping> q =
createSliceQuery(
keyspace,
getNewStringSerializer(),
KeySerializer.get(),
MappingSerializer.get());
q.setColumnFamily(CF_MAPPING);
q.setKey(key);
final int chunkSize = 1000;
Key start = null;
do {
q.setRange(start, null, false, chunkSize);
QueryResult<ColumnSlice<Key, Mapping>> r = q.execute();
final List<HColumn<Key, Mapping>> columns = r.get().getColumns();
for (final HColumn<Key, Mapping> c : columns) {
... (add to list)
}
if (columns.size() == chunkSize) {
start = columns.get(columns.size() - 1).getName();
} else {
start = null;
}
} while (start != null);
The code ran fine for several months. Some days ago the code above
returned much less columns than expected (e.g. 1010 instead of 198k or
14k instead of 44k).
Is there something wrong with the code?
As a result we created and stored new mappings and now everything is
fine again.
We realized that we had trouble with one node right before that
behaviour so we think that's the cause.
The node went down because of OOM, and during restart another OOM killed
the node again. One or two OOMs later the node started without any
trouble and all seemed fine. Some hours later the next import process
ran and then we could not read all the expected data.
As this happened two days ago at least a minor compaction took place so
all sstables after the node crash have been merged.
Is this a known issue or can somebody imaging what's the cause? If we
are lucky we have a backup after the crash and before the "repair", but
if not I don't have any ideas left how to figure out what happened.
So any idea about how to dig deeper into this is very welcome.
Best,
Thomas