That is very interesting and sounds like a fun friday project! Could you please elaborate on how you mapped the original format of
ngram TAB year TAB match_count TAB volume_count NEWLINE into Accumulo key/values? Could you briefly explain what feature in Accumulo is responsible for this improvement in storage efficiency. This could be a helpful illustration for users to know how key/value design can take advantage of these Accumulo features. Thanks a lot! Jared On Fri, May 3, 2013 at 1:24 PM, Eric Newton <[email protected]> wrote: > I think David Medinets suggested some publicly available data sources that > could be used to compare the storage requirements of different key/value > stores. > > Today I tried it out. > > I took the google 1-gram word lists and ingested them into accumulo. > > http://storage.googleapis.com/books/ngrams/books/datasetsv2.html > > It took about 15 minutes to ingest on a 10 node cluster (4 drives each). > > $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams > running... > 5.2 G /data/googlebooks/ngrams/1-grams > > $ hadoop fs -du -s -h /accumulo/tables/4 > running... > 4.1 G /accumulo/tables/4 > > The storage format in accumulo is about 20% more efficient than gzip'd csv > files. > > I'll post the 2-gram results sometime next month when its done > downloading. :-) > > -Eric, which occurred 221K times in 34K books in 2008. >
