Re: interesting

Josh Elser Wed, 15 May 2013 14:20:36 -0700

Definitely, with a note on the ingest job duration, too.


On 05/15/2013 04:27 PM, Christopher wrote:

I'd be very curious how something faster, like Snappy, compared.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Wed, May 15, 2013 at 2:52 PM, Eric Newton <[email protected]> wrote:

I don't intend to do that.


On Wed, May 15, 2013 at 12:11 PM, Josh Elser <[email protected]> wrote:

Just kidding, re-read the rest of this. Let me try again:

Any intents to retry this with different compression codecs?


On 5/15/13 12:00 PM, Josh Elser wrote:

RFile... with gzip? Or did you use another compressor?

On 5/15/13 10:58 AM, Eric Newton wrote:

I ingested the 2-gram data on a 10 node cluster.  It took just under 7
hours.  For most of the job, accumulo ingested at about 200K k-v/server.

$ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
/accumulo/tables/274632273653
/data/n-grams/2-grams154271541304

That's a very nice result.  RFile compressed the same data to half the
gzip'd CSV format.

There are 37,582,158,107 entries in the 2-gram set, which means that
accumulo is using only 2 bytes for each entry.

-Eric Newton, which appeared 62 times in 37 books in 2008.


On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected]
<mailto:[email protected]>> wrote:

     ngram == row
     year == column family
     count == column qualifier (prepended with zeros for sort)
     book count == value

     I used ascii text for the counts, even.

     I'm not sure if the google entries are sorted, so the sort would
     help compression.

     The RFile format does not repeat identical data from key to key, so
     in most cases, the row is not repeated.  That gives gzip other
     things to work on.

     I'll have to do more analysis to figure out why RFile did so well.
       Perhaps google used less aggressive settings for their
compression.

     I'm more interested in 2-grams to test our partial-row compression
     in 1.5.

     -Eric


     On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[email protected]
     <mailto:[email protected]>> wrote:

         That is very interesting and sounds like a fun friday project!
         Could you please elaborate on how you mapped the original
format of

         ngram TAB year TAB match_count TAB volume_count NEWLINE

         into Accumulo key/values? Could you briefly explain what feature
         in Accumulo is responsible for this improvement in storage
         efficiency. This could be a helpful illustration for users to
         know how key/value design can take advantage of these Accumulo
         features. Thanks a lot!

         Jared


         On Fri, May 3, 2013 at 1:24 PM, Eric Newton
         <[email protected] <mailto:[email protected]>> wrote:

             I think David Medinets suggested some publicly available
             data sources that could be used to compare the storage
             requirements of different key/value stores.

             Today I tried it out.

             I took the google 1-gram word lists and ingested them into
             accumulo.


http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

             It took about 15 minutes to ingest on a 10 node cluster (4
             drives each).

             $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
             running...
             5.2 G  /data/googlebooks/ngrams/1-grams

             $ hadoop fs -du -s -h /accumulo/tables/4
             running...
             4.1 G  /accumulo/tables/4

             The storage format in accumulo is about 20% more efficient
             than gzip'd csv files.

             I'll post the 2-gram results sometime next month when its
             done downloading. :-)

             -Eric, which occurred 221K times in 34K books in 2008.

Re: interesting

Reply via email to