gzip. In fact, everything was basically done w/the default settings.
On Wed, May 15, 2013 at 12:00 PM, Josh Elser <[email protected]> wrote: > RFile... with gzip? Or did you use another compressor? > > > On 5/15/13 10:58 AM, Eric Newton wrote: > >> I ingested the 2-gram data on a 10 node cluster. It took just under 7 >> hours. For most of the job, accumulo ingested at about 200K k-v/server. >> >> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams >> /accumulo/tables/274632273653 >> /data/n-grams/2-**grams154271541304 >> >> That's a very nice result. RFile compressed the same data to half the >> gzip'd CSV format. >> >> There are 37,582,158,107 entries in the 2-gram set, which means that >> accumulo is using only 2 bytes for each entry. >> >> -Eric Newton, which appeared 62 times in 37 books in 2008. >> >> >> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected] >> <mailto:[email protected]>**> wrote: >> >> ngram == row >> year == column family >> count == column qualifier (prepended with zeros for sort) >> book count == value >> >> I used ascii text for the counts, even. >> >> I'm not sure if the google entries are sorted, so the sort would >> help compression. >> >> The RFile format does not repeat identical data from key to key, so >> in most cases, the row is not repeated. That gives gzip other >> things to work on. >> >> I'll have to do more analysis to figure out why RFile did so well. >> Perhaps google used less aggressive settings for their compression. >> >> I'm more interested in 2-grams to test our partial-row compression >> in 1.5. >> >> -Eric >> >> >> On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[email protected] >> <mailto:[email protected]>**> wrote: >> >> That is very interesting and sounds like a fun friday project! >> Could you please elaborate on how you mapped the original format >> of >> >> ngram TAB year TAB match_count TAB volume_count NEWLINE >> >> into Accumulo key/values? Could you briefly explain what feature >> in Accumulo is responsible for this improvement in storage >> efficiency. This could be a helpful illustration for users to >> know how key/value design can take advantage of these Accumulo >> features. Thanks a lot! >> >> Jared >> >> >> On Fri, May 3, 2013 at 1:24 PM, Eric Newton >> <[email protected] <mailto:[email protected]>**> wrote: >> >> I think David Medinets suggested some publicly available >> data sources that could be used to compare the storage >> requirements of different key/value stores. >> >> Today I tried it out. >> >> I took the google 1-gram word lists and ingested them into >> accumulo. >> >> http://storage.googleapis.com/** >> books/ngrams/books/datasetsv2.**html<http://storage.googleapis.com/books/ngrams/books/datasetsv2.html> >> >> It took about 15 minutes to ingest on a 10 node cluster (4 >> drives each). >> >> $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-**grams >> running... >> 5.2 G /data/googlebooks/ngrams/1-**grams >> >> $ hadoop fs -du -s -h /accumulo/tables/4 >> running... >> 4.1 G /accumulo/tables/4 >> >> The storage format in accumulo is about 20% more efficient >> than gzip'd csv files. >> >> I'll post the 2-gram results sometime next month when its >> done downloading. :-) >> >> -Eric, which occurred 221K times in 34K books in 2008. >> >> >> >> >>
