Definitely. This is pretty cool. Thanks for sharing.

I'll have to try to make some time to poke around with this :)

William Slacum wrote:
Thanks, Jonathan! I've wondered about specific numbers on this topic
when dealing with geohashes, so this is a very useful tool.

On Sun, Oct 25, 2015 at 11:22 AM, Jonathan Wonders <[email protected]
<mailto:[email protected]>> wrote:

    I have been able to put some more thought into this over the weekend
    and make initial observations on tables I currently have populated.
    Looking at the rfile-info for a few different tables, I noticed that
    one which has particularly small lexicographical deltas between keys
    costs an average of ~2.5 bits per key to store on disk.  All of the
    data is stored in the row component of the key and a full row is
    typically about 36 bytes.  I wrote a little utility to recreate
    ScanResultobjects for batches of sequential key-value pairs returned
    from a scanner and then used the TCompactProtocol to write the
    ScanResult to a byte array.  Each key-value pair costs rougly 48
    bytes which makes sense given that every row is different and there
    will be some space required for the timestamps, visibilities, and
    other bookeeping info.

    Another table I looked at has larger lexicographical deltas between
    keys and costs roughly 5 bytes per key to store on disk.  This table
    is a reverse index with very large rows, each column within a row
    identifies data that resides in another table.  Each column is
    rougly 12 bytes uncompressed.  When encoded in a ScanResult, each
    key-value pair costs roughly 25 bytes which makes sense since the
    row cost should be negligible for large batch sizes and the overhead
    from timestamp, visibility, and other bookkeeping info is roungly
    the same as the other table.

    Since compression depends heavily on both table design and the
    actual data, it seemed the next logical step would be to create a
    tool that the community could use to easily measure the compression
    ratio for ScanResultobjects. So, I threw together a shell extension
    to wrap the utility that I previously described.  It measures
    compression ratio for the default strategy Key.compress as well as a
    few other simple strategies that seemed reasonable to test. The
    usage is almost the same as the scan command, it just prints out
    compression statistics rather than the data.

    It lives at https://github.com/jwonders/accumulo-experiments with
    branches for Accumulo 1.6.x, 1.7.x, and 1.8.x.

    Any feedback is welcome.  I hope others find this useful for
    understanding this particular aspect of scan performance.

    V/R
    Jonathan


    On Thu, Oct 22, 2015 at 4:37 PM, Jonathan Wonders
    <[email protected] <mailto:[email protected]>> wrote:

        Josh,

        Thanks for the information.  I did read through the discussion
        about compression of visibility expressions and columns within
        RFiles a while back which got me thinking about some of this. It
        makes sense that gzip or lzo/snappy compression would have a
        very noticeable impact when there are columns or visibility
        expressions that are not compressed with RLE even if neighboring
        rows have very small lexicographical deltas.

        I will put some thought into desiging an experiment to evaluate
        whether or not there is any benefit to applying RLE during
        key-value transport from server to client.  Even if it proves to
        be situationally beneficial, I think it could be implemented as
        a common iterator similar to the WholeRowIterator.

        Given, the current compression strategy I would expect better
        server-client transport compression retrieving a single row with
        many columns

        <key><value>        : <id>    []

        compared to many lexicographically close rows.

        <key><value><id>    :         []

        with the understanding that very large rows can lead to poor
        load balancing.

        V/R
        Jonathan

        On Thu, Oct 22, 2015 at 11:54 AM, Josh Elser
        <[email protected] <mailto:[email protected]>> wrote:

            Jonathan Wonders wrote:

                I have been digging into some details of Accumulo to
                model the disk and
                network costs associated with various types of scan
                patterns and I have
                a few questions regarding compression.

                Assuming an inverted index table with rows following the
                pattern of

                <key><value><id>

                and a scan that specifies an exact key and value so as
                to constrain the
                range, it seems that the dominant factor in network
                utiltization would
                be sending key-value pairs from the tablet server to the
                client and a
                secondary factor would be transmitting data from
                non-local RFiles
                (assuming no caching).


            Sounds about right to me.

                Is my understanding correct that the on-disk compression
                of this type of
                table is predominantly a function of the average number
                of differing
                bits between adjacent ids?  Or, has anyone observed a
                significant
                improvement with gz or lzo vs no additional
                compression?  I'm
                considering running some experiments to measure the
                difference for a few
                types of ids (uuid, snowflake-like, content based
                hashes), but I'm
                curious if anyone else has done similar experiments.


            There is definitely a noticed difference between gzip,
            lzo/snappy (they're about on par with speed and size), and
            no compression. I'm not sure what the deltas are off the top
            of my head, but I would expect that you see noticeable
            differences.

            You can also notice differences in "densities" when the
            lexicographical delta between two keys. Sequential keys that
            differ very little will result in very dense RFiles (lots of
            keys), where keys that vary greatly will result in less keys
            per file. We've had some discussions about the run-length
            encoding in RFile lately -- have you stumbled on that?

                Given a scan that specifies a range for an exact key and
                value, is there
                any transport compression performed for tablet server to
                client
                communication beyond the Key.compress method which
                appears to only
                compress equivalent rows, columns, etc as opposed to
                those that share a
                common prefix?


            No, there are no compression algorithms applied to the data
            before sending it. By default we use the TCompactProtocol
            from Thrift. If you're interested in the specifics, Thrift
            has some documentation on the subject.

            I do recall some experiments I tried previously in which a
            logical "object" in Accumulo was comprised of multiple
            key-value pairs which were returned in one key-value pair
            (very similar to how the WholeRowIterator serializes
            things). In this case, I experimented with compressing the
            serialized Key-Values before returning from the Iterator
            stack. Sadly, I don't recall what the takeaway was, but I
            don't believe it was outright "bad" :)

                It seems possible to implement a more specialized
                compression algorithm
                with the iterator framework, performing the
                decompression on the client
                side, but I'm curious if it could lead to general scan
                performance
                improvements if the default compression also involved
                run-length encoding.


            This would be really neat to test.

            Some rigor in designing an experiment would be the first
            step, IMO. If you want to spear-head this, I'm sure there
            are many who would be happy to give some feedback. It would
            be prudent to figure out what the variables in the
            "equation" would be, specifically the distribution of
            Key-Value pairs -- amount stored, compression on disk, cache
            sizes, query workload. First out what you want to test, what
            you will tweak, and define a hypothesis you want to test.

            You could consider using the continuous ingest framework for
            data-generation and query workloads. You could also look at
            YCSB [1]
            for some inspiration. Either of these would make your life
            easier in not having to generate the read/write workloads.


            [1] https://github.com/brianfrankcooper/YCSB


                Any insight on this subject is much appreciated.

                V/R
                Jonathan




Reply via email to