On Mon, Jan 12, 2015 at 4:10 PM, Josh Elser <josh.el...@gmail.com> wrote: > seek()'ing doesn't always imply an increase in performance -- remember that > RFiles (the files that back Accumulo tables), are composed of multiple > blocks/sections with an index of them. A seek is comprised of using that > index to find the block/section of the RFile and then a linear scan forward > to find the first key for the range you seek()'ed to. > > Thus, if you're repeatedly re-seek()'ing within the same block, you'll waste > a lot of time re-read the same data. In your situation, it sounds like the > cost of re-reading the data after a seek is about the same as naively > consuming the records. > > You can try altering table.file.compress.blocksize (and then compacting your > table) to see how this changes. >
There is actually some fairly well-optimized code in the RFile seek that minimizes the re-reading of RFile data and index blocks. Seeking forward by one key adds a couple of key comparisons and function calls, but that's about it. Incidentally, key comparisons are pretty high up on my list of things that could use some performance optimization. Adam