Node OOM, Slice query - missing data?

Thomas Richter Wed, 02 Nov 2011 16:13:30 -0700

Hi there,

We run a 3 node cluster with 0.7.8 with replication factor 3 for all keyspaces.

We store external->internal key mappings in a column family with one rowfor each customer. The largest row contains abount 200k columns.If we import external data we load the whole row and map external tointernal keys. Loading is done like


SliceQuery<String, Key, Mapping> q =
createSliceQuery(
                keyspace,
                getNewStringSerializer(),
                KeySerializer.get(),
                MappingSerializer.get());
q.setColumnFamily(CF_MAPPING);
q.setKey(key);
final int chunkSize = 1000;
Key start = null;
do {
        q.setRange(start, null, false, chunkSize);
        QueryResult<ColumnSlice<Key, Mapping>> r = q.execute();
        final List<HColumn<Key, Mapping>> columns = r.get().getColumns();
        for (final HColumn<Key, Mapping> c : columns) {
                ... (add to list)
        }
        if (columns.size() == chunkSize) {
                start = columns.get(columns.size() - 1).getName();
        } else {
                start = null;
        }
} while (start != null);

The code ran fine for several months. Some days ago the code abovereturned much less columns than expected (e.g. 1010 instead of 198k or14k instead of 44k).

Is there something wrong with the code?

As a result we created and stored new mappings and now everything isfine again.

We realized that we had trouble with one node right before thatbehaviour so we think that's the cause.

The node went down because of OOM, and during restart another OOM killedthe node again. One or two OOMs later the node started without anytrouble and all seemed fine. Some hours later the next import processran and then we could not read all the expected data.

As this happened two days ago at least a minor compaction took place soall sstables after the node crash have been merged.

Is this a known issue or can somebody imaging what's the cause? If weare lucky we have a backup after the crash and before the "repair", butif not I don't have any ideas left how to figure out what happened.


So any idea about how to dig deeper into this is very welcome.

Best,

Thomas

Node OOM, Slice query - missing data?

Reply via email to