For comparison, I posted this some time ago: http://tinyurl.com/k28bkbg
I was surprised that RFile was smaller than a gzip'd CSV file, too. On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner <[email protected]> wrote: > > > > On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M. <[email protected]> > wrote: >> >> Hello, >> >> >> >> I’m seeing about an order of magnitude difference between the number of >> bytes returned by mutation.numBytes() and the size of the rfiles on disk >> (Accumulo 1.4.2). Note that all of my mutations are new entries, and there >> are no combiners running. >> >> >> >> While I understand that there is some compression on the rfile, I would be >> really surprised if it was 10:1. >> >> >> >> My entries are composed of a row ID (most of which is equivalent to the >> previous row ID), an empty column family, a nonempty column qualifier (which >> likely shares a lot with the previous qualifier), and an empty value. An >> example of the rowID and column qualifier might be: > > > In 1.4 if a field (row, col fam, etc) in key is the same as the previous, > then its not written again. So if the row is the same in 10 consecutive > keys, its only written once. Maybe this explains the difference. Scan the > table to make sure all of the data you expect to be there is there. > >> >> >> >> (forward table) >> >> 0000000000000|9|fa19 IP|127.000.000.001 >> >> 0000000000000|9|fa19 PORT|00080 >> >> … >> >> 0000000000000|9|fa22 IP|128.032.144.139 >> >> … >> >> <timeblock>|<hash>|<uid> <index>|<textual value> >> >> >> >> OR >> >> (reverse table) >> >> 0000000000000|IP|127.000.000.001 fa19 >> >> 0000000000000|IP|127.000.000.001 fd02 >> >> 0000000000000|IP|127.000.000.002 123 >> >> … >> >> 0000000000000|PORT|00080 fa19 >> >> >> >> The numBytes() method appears to return a number of bytes equal to the >> string length of the row ID and column qualifiers, plus 26 * # of column >> qualifiers. >> >> >> >> Is there something else that I’m missing, or would this possibly compress >> by that much? >> >> >> >> Thanks, >> >> David > >
