Eric, what version of Accumulo did you use? I'm assuming 1.5.0
On Wed, May 15, 2013 at 5:20 PM, Josh Elser <[email protected]> wrote: > Definitely, with a note on the ingest job duration, too. > > > On 05/15/2013 04:27 PM, Christopher wrote: > >> I'd be very curious how something faster, like Snappy, compared. >> >> -- >> Christopher L Tubbs II >> http://gravatar.com/ctubbsii >> >> >> On Wed, May 15, 2013 at 2:52 PM, Eric Newton <[email protected]> >> wrote: >> >>> I don't intend to do that. >>> >>> >>> On Wed, May 15, 2013 at 12:11 PM, Josh Elser <[email protected]> >>> wrote: >>> >>>> Just kidding, re-read the rest of this. Let me try again: >>>> >>>> Any intents to retry this with different compression codecs? >>>> >>>> >>>> On 5/15/13 12:00 PM, Josh Elser wrote: >>>> >>>>> RFile... with gzip? Or did you use another compressor? >>>>> >>>>> On 5/15/13 10:58 AM, Eric Newton wrote: >>>>> >>>>>> I ingested the 2-gram data on a 10 node cluster. It took just under 7 >>>>>> hours. For most of the job, accumulo ingested at about 200K >>>>>> k-v/server. >>>>>> >>>>>> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams >>>>>> /accumulo/tables/274632273653 >>>>>> /data/n-grams/2-**grams154271541304 >>>>>> >>>>>> That's a very nice result. RFile compressed the same data to half the >>>>>> gzip'd CSV format. >>>>>> >>>>>> There are 37,582,158,107 entries in the 2-gram set, which means that >>>>>> accumulo is using only 2 bytes for each entry. >>>>>> >>>>>> -Eric Newton, which appeared 62 times in 37 books in 2008. >>>>>> >>>>>> >>>>>> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected] >>>>>> <mailto:[email protected]>**> wrote: >>>>>> >>>>>> ngram == row >>>>>> year == column family >>>>>> count == column qualifier (prepended with zeros for sort) >>>>>> book count == value >>>>>> >>>>>> I used ascii text for the counts, even. >>>>>> >>>>>> I'm not sure if the google entries are sorted, so the sort would >>>>>> help compression. >>>>>> >>>>>> The RFile format does not repeat identical data from key to key, >>>>>> so >>>>>> in most cases, the row is not repeated. That gives gzip other >>>>>> things to work on. >>>>>> >>>>>> I'll have to do more analysis to figure out why RFile did so >>>>>> well. >>>>>> Perhaps google used less aggressive settings for their >>>>>> compression. >>>>>> >>>>>> I'm more interested in 2-grams to test our partial-row >>>>>> compression >>>>>> in 1.5. >>>>>> >>>>>> -Eric >>>>>> >>>>>> >>>>>> On Fri, May 3, 2013 at 4:09 PM, Jared Winick < >>>>>> [email protected] >>>>>> <mailto:[email protected]>**> wrote: >>>>>> >>>>>> That is very interesting and sounds like a fun friday >>>>>> project! >>>>>> Could you please elaborate on how you mapped the original >>>>>> format of >>>>>> >>>>>> ngram TAB year TAB match_count TAB volume_count NEWLINE >>>>>> >>>>>> into Accumulo key/values? Could you briefly explain what >>>>>> feature >>>>>> in Accumulo is responsible for this improvement in storage >>>>>> efficiency. This could be a helpful illustration for users to >>>>>> know how key/value design can take advantage of these >>>>>> Accumulo >>>>>> features. Thanks a lot! >>>>>> >>>>>> Jared >>>>>> >>>>>> >>>>>> On Fri, May 3, 2013 at 1:24 PM, Eric Newton >>>>>> <[email protected] <mailto:[email protected]>**> >>>>>> wrote: >>>>>> >>>>>> I think David Medinets suggested some publicly available >>>>>> data sources that could be used to compare the storage >>>>>> requirements of different key/value stores. >>>>>> >>>>>> Today I tried it out. >>>>>> >>>>>> I took the google 1-gram word lists and ingested them >>>>>> into >>>>>> accumulo. >>>>>> >>>>>> >>>>>> http://storage.googleapis.com/**books/ngrams/books/datasetsv2.**html<http://storage.googleapis.com/books/ngrams/books/datasetsv2.html> >>>>>> >>>>>> It took about 15 minutes to ingest on a 10 node cluster >>>>>> (4 >>>>>> drives each). >>>>>> >>>>>> $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-**grams >>>>>> running... >>>>>> 5.2 G /data/googlebooks/ngrams/1-**grams >>>>>> >>>>>> $ hadoop fs -du -s -h /accumulo/tables/4 >>>>>> running... >>>>>> 4.1 G /accumulo/tables/4 >>>>>> >>>>>> The storage format in accumulo is about 20% more >>>>>> efficient >>>>>> than gzip'd csv files. >>>>>> >>>>>> I'll post the 2-gram results sometime next month when its >>>>>> done downloading. :-) >>>>>> >>>>>> -Eric, which occurred 221K times in 34K books in 2008. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >
