I ingested the 2-gram data on a 10 node cluster. It took just under 7 hours. For most of the job, accumulo ingested at about 200K k-v/server.
$ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams /accumulo/tables/2 74632273653 /data/n-grams/2-grams 154271541304 That's a very nice result. RFile compressed the same data to half the gzip'd CSV format. There are 37,582,158,107 entries in the 2-gram set, which means that accumulo is using only 2 bytes for each entry. -Eric Newton, which appeared 62 times in 37 books in 2008. On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected]> wrote: > ngram == row > year == column family > count == column qualifier (prepended with zeros for sort) > book count == value > > I used ascii text for the counts, even. > > I'm not sure if the google entries are sorted, so the sort would help > compression. > > The RFile format does not repeat identical data from key to key, so in > most cases, the row is not repeated. That gives gzip other things to work > on. > > I'll have to do more analysis to figure out why RFile did so well. > Perhaps google used less aggressive settings for their compression. > > I'm more interested in 2-grams to test our partial-row compression in 1.5. > > -Eric > > > On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[email protected]>wrote: > >> That is very interesting and sounds like a fun friday project! Could you >> please elaborate on how you mapped the original format of >> >> ngram TAB year TAB match_count TAB volume_count NEWLINE >> >> into Accumulo key/values? Could you briefly explain what feature in >> Accumulo is responsible for this improvement in storage efficiency. This >> could be a helpful illustration for users to know how key/value design can >> take advantage of these Accumulo features. Thanks a lot! >> >> Jared >> >> >> On Fri, May 3, 2013 at 1:24 PM, Eric Newton <[email protected]>wrote: >> >>> I think David Medinets suggested some publicly available data sources >>> that could be used to compare the storage requirements of different >>> key/value stores. >>> >>> Today I tried it out. >>> >>> I took the google 1-gram word lists and ingested them into accumulo. >>> >>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html >>> >>> It took about 15 minutes to ingest on a 10 node cluster (4 drives each). >>> >>> $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams >>> running... >>> 5.2 G /data/googlebooks/ngrams/1-grams >>> >>> $ hadoop fs -du -s -h /accumulo/tables/4 >>> running... >>> 4.1 G /accumulo/tables/4 >>> >>> The storage format in accumulo is about 20% more efficient than gzip'd >>> csv files. >>> >>> I'll post the 2-gram results sometime next month when its done >>> downloading. :-) >>> >>> -Eric, which occurred 221K times in 34K books in 2008. >>> >> >> >
