Hi there,

We run a 3 node cluster with 0.7.8 with replication factor 3 for all key spaces.

We store external->internal key mappings in a column family with one row for each customer. The largest row contains abount 200k columns. If we import external data we load the whole row and map external to internal keys. Loading is done like

SliceQuery<String, Key, Mapping> q =
createSliceQuery(
                keyspace,
                getNewStringSerializer(),
                KeySerializer.get(),
                MappingSerializer.get());
q.setColumnFamily(CF_MAPPING);
q.setKey(key);
final int chunkSize = 1000;
Key start = null;
do {
        q.setRange(start, null, false, chunkSize);
        QueryResult<ColumnSlice<Key, Mapping>> r = q.execute();
        final List<HColumn<Key, Mapping>> columns = r.get().getColumns();
        for (final HColumn<Key, Mapping> c : columns) {
                ... (add to list)
        }
        if (columns.size() == chunkSize) {
                start = columns.get(columns.size() - 1).getName();
        } else {
                start = null;
        }
} while (start != null);

The code ran fine for several months. Some days ago the code above returned much less columns than expected (e.g. 1010 instead of 198k or 14k instead of 44k).
Is there something wrong with the code?
As a result we created and stored new mappings and now everything is fine again.

We realized that we had trouble with one node right before that behaviour so we think that's the cause.

The node went down because of OOM, and during restart another OOM killed the node again. One or two OOMs later the node started without any trouble and all seemed fine. Some hours later the next import process ran and then we could not read all the expected data.

As this happened two days ago at least a minor compaction took place so all sstables after the node crash have been merged.

Is this a known issue or can somebody imaging what's the cause? If we are lucky we have a backup after the crash and before the "repair", but if not I don't have any ideas left how to figure out what happened.

So any idea about how to dig deeper into this is very welcome.

Best,

Thomas

Reply via email to