Comparing the rfiles with compressed CSV files, the results do make sense now.
Thanks, David -----Original Message----- From: Eric Newton [mailto:[email protected]] Sent: Tuesday, October 29, 2013 11:05 PM To: [email protected] Subject: Re: sum of mutation.numBytes() significantly different from rfile size For comparison, I posted this some time ago: http://tinyurl.com/k28bkbg I was surprised that RFile was smaller than a gzip'd CSV file, too. On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner <[email protected]> wrote: > > > > On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M. > <[email protected]> > wrote: >> >> Hello, >> >> >> >> I'm seeing about an order of magnitude difference between the number >> of bytes returned by mutation.numBytes() and the size of the rfiles >> on disk (Accumulo 1.4.2). Note that all of my mutations are new >> entries, and there are no combiners running. >> >> >> >> While I understand that there is some compression on the rfile, I >> would be really surprised if it was 10:1. >> >> >> >> My entries are composed of a row ID (most of which is equivalent to >> the previous row ID), an empty column family, a nonempty column >> qualifier (which likely shares a lot with the previous qualifier), >> and an empty value. An example of the rowID and column qualifier might be: > > > In 1.4 if a field (row, col fam, etc) in key is the same as the > previous, then its not written again. So if the row is the same in 10 > consecutive > keys, its only written once. Maybe this explains the difference. Scan the > table to make sure all of the data you expect to be there is there. > >> >> >> >> (forward table) >> >> 0000000000000|9|fa19 IP|127.000.000.001 >> >> 0000000000000|9|fa19 PORT|00080 >> >> ... >> >> 0000000000000|9|fa22 IP|128.032.144.139 >> >> ... >> >> <timeblock>|<hash>|<uid> <index>|<textual value> >> >> >> >> OR >> >> (reverse table) >> >> 0000000000000|IP|127.000.000.001 fa19 >> >> 0000000000000|IP|127.000.000.001 fd02 >> >> 0000000000000|IP|127.000.000.002 123 >> >> ... >> >> 0000000000000|PORT|00080 fa19 >> >> >> >> The numBytes() method appears to return a number of bytes equal to >> the string length of the row ID and column qualifiers, plus 26 * # of >> column qualifiers. >> >> >> >> Is there something else that I'm missing, or would this possibly >> compress by that much? >> >> >> >> Thanks, >> >> David > >
