I think David Medinets suggested some publicly available data sources that could be used to compare the storage requirements of different key/value stores.
Today I tried it out. I took the google 1-gram word lists and ingested them into accumulo. http://storage.googleapis.com/books/ngrams/books/datasetsv2.html It took about 15 minutes to ingest on a 10 node cluster (4 drives each). $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams running... 5.2 G /data/googlebooks/ngrams/1-grams $ hadoop fs -du -s -h /accumulo/tables/4 running... 4.1 G /accumulo/tables/4 The storage format in accumulo is about 20% more efficient than gzip'd csv files. I'll post the 2-gram results sometime next month when its done downloading. :-) -Eric, which occurred 221K times in 34K books in 2008.
