We are confident that we are doing everything right in both cases (no
bugs), yet the results are baffling. Tests in smaller, single-node
environments results in consistent counts between the two methods, but we
don't have the same amount of data nor the same topology.
Are you somehow using an
Hi--
We are iterating rows in a column family two different ways and are
seeing radically different row counts. We are using 1.0.8 and
RandomPartitioner on a 3-node cluster.
In the first case, we have a trivial Hadoop job that counts 29M rows
using the standard MR pattern for counting
Are there any deletions in your data? The Hadoop support doesn't filter out
tombstones, though you may not be filtering them out in your code either. I've
used the hadoop support for doing a lot of data validation in the past and as
long as you're sure that the code is sound, I'm pretty