I've have a mapreduce job that reads rfiles as Accumulo key/value pairs using FileSKVIterator within a RecordReader, partition/shuffles them based on the byte string of the key, and writes them out as new rfiles using the AccumuloFileOutputFormat. The objective is to create larger rfiles for bulk ingesting and to minimize the number of tservers each rfile is assigned to after they are bulk ingested.
For tables with a simple schema it works fine, but for tables with complex schema the new rfiles are causing the tservers to throw a null pointer exception during a compaction. Is there more to an rfile than just the key/value pairs that I am missing? If I compute an order independent checksum of the bytes of the key/value pairs in the original rfiles and the new rfiles shouldn't they be the same?