I'd be very curious how something faster, like Snappy, compared. -- Christopher L Tubbs II http://gravatar.com/ctubbsii
On Wed, May 15, 2013 at 2:52 PM, Eric Newton <[email protected]> wrote: > I don't intend to do that. > > > On Wed, May 15, 2013 at 12:11 PM, Josh Elser <[email protected]> wrote: >> >> Just kidding, re-read the rest of this. Let me try again: >> >> Any intents to retry this with different compression codecs? >> >> >> On 5/15/13 12:00 PM, Josh Elser wrote: >>> >>> RFile... with gzip? Or did you use another compressor? >>> >>> On 5/15/13 10:58 AM, Eric Newton wrote: >>>> >>>> I ingested the 2-gram data on a 10 node cluster. It took just under 7 >>>> hours. For most of the job, accumulo ingested at about 200K k-v/server. >>>> >>>> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams >>>> /accumulo/tables/274632273653 >>>> /data/n-grams/2-grams154271541304 >>>> >>>> That's a very nice result. RFile compressed the same data to half the >>>> gzip'd CSV format. >>>> >>>> There are 37,582,158,107 entries in the 2-gram set, which means that >>>> accumulo is using only 2 bytes for each entry. >>>> >>>> -Eric Newton, which appeared 62 times in 37 books in 2008. >>>> >>>> >>>> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> ngram == row >>>> year == column family >>>> count == column qualifier (prepended with zeros for sort) >>>> book count == value >>>> >>>> I used ascii text for the counts, even. >>>> >>>> I'm not sure if the google entries are sorted, so the sort would >>>> help compression. >>>> >>>> The RFile format does not repeat identical data from key to key, so >>>> in most cases, the row is not repeated. That gives gzip other >>>> things to work on. >>>> >>>> I'll have to do more analysis to figure out why RFile did so well. >>>> Perhaps google used less aggressive settings for their >>>> compression. >>>> >>>> I'm more interested in 2-grams to test our partial-row compression >>>> in 1.5. >>>> >>>> -Eric >>>> >>>> >>>> On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> That is very interesting and sounds like a fun friday project! >>>> Could you please elaborate on how you mapped the original >>>> format of >>>> >>>> ngram TAB year TAB match_count TAB volume_count NEWLINE >>>> >>>> into Accumulo key/values? Could you briefly explain what feature >>>> in Accumulo is responsible for this improvement in storage >>>> efficiency. This could be a helpful illustration for users to >>>> know how key/value design can take advantage of these Accumulo >>>> features. Thanks a lot! >>>> >>>> Jared >>>> >>>> >>>> On Fri, May 3, 2013 at 1:24 PM, Eric Newton >>>> <[email protected] <mailto:[email protected]>> wrote: >>>> >>>> I think David Medinets suggested some publicly available >>>> data sources that could be used to compare the storage >>>> requirements of different key/value stores. >>>> >>>> Today I tried it out. >>>> >>>> I took the google 1-gram word lists and ingested them into >>>> accumulo. >>>> >>>> >>>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html >>>> >>>> It took about 15 minutes to ingest on a 10 node cluster (4 >>>> drives each). >>>> >>>> $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams >>>> running... >>>> 5.2 G /data/googlebooks/ngrams/1-grams >>>> >>>> $ hadoop fs -du -s -h /accumulo/tables/4 >>>> running... >>>> 4.1 G /accumulo/tables/4 >>>> >>>> The storage format in accumulo is about 20% more efficient >>>> than gzip'd csv files. >>>> >>>> I'll post the 2-gram results sometime next month when its >>>> done downloading. :-) >>>> >>>> -Eric, which occurred 221K times in 34K books in 2008. >>>> >>>> >>>> >>>> >
