I don't intend to do that.
On Wed, May 15, 2013 at 12:11 PM, Josh Elser <[email protected]> wrote: > Just kidding, re-read the rest of this. Let me try again: > > Any intents to retry this with different compression codecs? > > > On 5/15/13 12:00 PM, Josh Elser wrote: > >> RFile... with gzip? Or did you use another compressor? >> >> On 5/15/13 10:58 AM, Eric Newton wrote: >> >>> I ingested the 2-gram data on a 10 node cluster. It took just under 7 >>> hours. For most of the job, accumulo ingested at about 200K k-v/server. >>> >>> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams >>> /accumulo/tables/274632273653 >>> /data/n-grams/2-**grams154271541304 >>> >>> That's a very nice result. RFile compressed the same data to half the >>> gzip'd CSV format. >>> >>> There are 37,582,158,107 entries in the 2-gram set, which means that >>> accumulo is using only 2 bytes for each entry. >>> >>> -Eric Newton, which appeared 62 times in 37 books in 2008. >>> >>> >>> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected] >>> <mailto:[email protected]>**> wrote: >>> >>> ngram == row >>> year == column family >>> count == column qualifier (prepended with zeros for sort) >>> book count == value >>> >>> I used ascii text for the counts, even. >>> >>> I'm not sure if the google entries are sorted, so the sort would >>> help compression. >>> >>> The RFile format does not repeat identical data from key to key, so >>> in most cases, the row is not repeated. That gives gzip other >>> things to work on. >>> >>> I'll have to do more analysis to figure out why RFile did so well. >>> Perhaps google used less aggressive settings for their compression. >>> >>> I'm more interested in 2-grams to test our partial-row compression >>> in 1.5. >>> >>> -Eric >>> >>> >>> On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[email protected] >>> <mailto:[email protected]>**> wrote: >>> >>> That is very interesting and sounds like a fun friday project! >>> Could you please elaborate on how you mapped the original >>> format of >>> >>> ngram TAB year TAB match_count TAB volume_count NEWLINE >>> >>> into Accumulo key/values? Could you briefly explain what feature >>> in Accumulo is responsible for this improvement in storage >>> efficiency. This could be a helpful illustration for users to >>> know how key/value design can take advantage of these Accumulo >>> features. Thanks a lot! >>> >>> Jared >>> >>> >>> On Fri, May 3, 2013 at 1:24 PM, Eric Newton >>> <[email protected] <mailto:[email protected]>**> wrote: >>> >>> I think David Medinets suggested some publicly available >>> data sources that could be used to compare the storage >>> requirements of different key/value stores. >>> >>> Today I tried it out. >>> >>> I took the google 1-gram word lists and ingested them into >>> accumulo. >>> >>> >>> http://storage.googleapis.com/**books/ngrams/books/datasetsv2.**html<http://storage.googleapis.com/books/ngrams/books/datasetsv2.html> >>> >>> It took about 15 minutes to ingest on a 10 node cluster (4 >>> drives each). >>> >>> $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-**grams >>> running... >>> 5.2 G /data/googlebooks/ngrams/1-**grams >>> >>> $ hadoop fs -du -s -h /accumulo/tables/4 >>> running... >>> 4.1 G /accumulo/tables/4 >>> >>> The storage format in accumulo is about 20% more efficient >>> than gzip'd csv files. >>> >>> I'll post the 2-gram results sometime next month when its >>> done downloading. :-) >>> >>> -Eric, which occurred 221K times in 34K books in 2008. >>> >>> >>> >>> >>>
