RFile... with gzip? Or did you use another compressor?

On 5/15/13 10:58 AM, Eric Newton wrote:
I ingested the 2-gram data on a 10 node cluster.  It took just under 7
hours.  For most of the job, accumulo ingested at about 200K k-v/server.

$ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
/accumulo/tables/274632273653
/data/n-grams/2-grams154271541304

That's a very nice result.  RFile compressed the same data to half the
gzip'd CSV format.

There are 37,582,158,107 entries in the 2-gram set, which means that
accumulo is using only 2 bytes for each entry.

-Eric Newton, which appeared 62 times in 37 books in 2008.


On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected]
<mailto:[email protected]>> wrote:

    ngram == row
    year == column family
    count == column qualifier (prepended with zeros for sort)
    book count == value

    I used ascii text for the counts, even.

    I'm not sure if the google entries are sorted, so the sort would
    help compression.

    The RFile format does not repeat identical data from key to key, so
    in most cases, the row is not repeated.  That gives gzip other
    things to work on.

    I'll have to do more analysis to figure out why RFile did so well.
      Perhaps google used less aggressive settings for their compression.

    I'm more interested in 2-grams to test our partial-row compression
    in 1.5.

    -Eric


    On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[email protected]
    <mailto:[email protected]>> wrote:

        That is very interesting and sounds like a fun friday project!
        Could you please elaborate on how you mapped the original format of

        ngram TAB year TAB match_count TAB volume_count NEWLINE

        into Accumulo key/values? Could you briefly explain what feature
        in Accumulo is responsible for this improvement in storage
        efficiency. This could be a helpful illustration for users to
        know how key/value design can take advantage of these Accumulo
        features. Thanks a lot!

        Jared


        On Fri, May 3, 2013 at 1:24 PM, Eric Newton
        <[email protected] <mailto:[email protected]>> wrote:

            I think David Medinets suggested some publicly available
            data sources that could be used to compare the storage
            requirements of different key/value stores.

            Today I tried it out.

            I took the google 1-gram word lists and ingested them into
            accumulo.

            http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

            It took about 15 minutes to ingest on a 10 node cluster (4
            drives each).

            $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
            running...
            5.2 G  /data/googlebooks/ngrams/1-grams

            $ hadoop fs -du -s -h /accumulo/tables/4
            running...
            4.1 G  /accumulo/tables/4

            The storage format in accumulo is about 20% more efficient
            than gzip'd csv files.

            I'll post the 2-gram results sometime next month when its
            done downloading. :-)

            -Eric, which occurred 221K times in 34K books in 2008.




Reply via email to