The version that will be 1.5.0... more-or-less.
On Sun, May 19, 2013 at 10:13 PM, Jim Klucar <[email protected]> wrote: > Eric, what version of Accumulo did you use? I'm assuming 1.5.0 > > > On Wed, May 15, 2013 at 5:20 PM, Josh Elser <[email protected]> wrote: > >> Definitely, with a note on the ingest job duration, too. >> >> >> On 05/15/2013 04:27 PM, Christopher wrote: >> >>> I'd be very curious how something faster, like Snappy, compared. >>> >>> -- >>> Christopher L Tubbs II >>> http://gravatar.com/ctubbsii >>> >>> >>> On Wed, May 15, 2013 at 2:52 PM, Eric Newton <[email protected]> >>> wrote: >>> >>>> I don't intend to do that. >>>> >>>> >>>> On Wed, May 15, 2013 at 12:11 PM, Josh Elser <[email protected]> >>>> wrote: >>>> >>>>> Just kidding, re-read the rest of this. Let me try again: >>>>> >>>>> Any intents to retry this with different compression codecs? >>>>> >>>>> >>>>> On 5/15/13 12:00 PM, Josh Elser wrote: >>>>> >>>>>> RFile... with gzip? Or did you use another compressor? >>>>>> >>>>>> On 5/15/13 10:58 AM, Eric Newton wrote: >>>>>> >>>>>>> I ingested the 2-gram data on a 10 node cluster. It took just under >>>>>>> 7 >>>>>>> hours. For most of the job, accumulo ingested at about 200K >>>>>>> k-v/server. >>>>>>> >>>>>>> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams >>>>>>> /accumulo/tables/274632273653 >>>>>>> /data/n-grams/2-**grams154271541304 >>>>>>> >>>>>>> That's a very nice result. RFile compressed the same data to half >>>>>>> the >>>>>>> gzip'd CSV format. >>>>>>> >>>>>>> There are 37,582,158,107 entries in the 2-gram set, which means that >>>>>>> accumulo is using only 2 bytes for each entry. >>>>>>> >>>>>>> -Eric Newton, which appeared 62 times in 37 books in 2008. >>>>>>> >>>>>>> >>>>>>> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected] >>>>>>> <mailto:[email protected]>**> wrote: >>>>>>> >>>>>>> ngram == row >>>>>>> year == column family >>>>>>> count == column qualifier (prepended with zeros for sort) >>>>>>> book count == value >>>>>>> >>>>>>> I used ascii text for the counts, even. >>>>>>> >>>>>>> I'm not sure if the google entries are sorted, so the sort would >>>>>>> help compression. >>>>>>> >>>>>>> The RFile format does not repeat identical data from key to >>>>>>> key, so >>>>>>> in most cases, the row is not repeated. That gives gzip other >>>>>>> things to work on. >>>>>>> >>>>>>> I'll have to do more analysis to figure out why RFile did so >>>>>>> well. >>>>>>> Perhaps google used less aggressive settings for their >>>>>>> compression. >>>>>>> >>>>>>> I'm more interested in 2-grams to test our partial-row >>>>>>> compression >>>>>>> in 1.5. >>>>>>> >>>>>>> -Eric >>>>>>> >>>>>>> >>>>>>> On Fri, May 3, 2013 at 4:09 PM, Jared Winick < >>>>>>> [email protected] >>>>>>> <mailto:[email protected]>**> wrote: >>>>>>> >>>>>>> That is very interesting and sounds like a fun friday >>>>>>> project! >>>>>>> Could you please elaborate on how you mapped the original >>>>>>> format of >>>>>>> >>>>>>> ngram TAB year TAB match_count TAB volume_count NEWLINE >>>>>>> >>>>>>> into Accumulo key/values? Could you briefly explain what >>>>>>> feature >>>>>>> in Accumulo is responsible for this improvement in storage >>>>>>> efficiency. This could be a helpful illustration for users >>>>>>> to >>>>>>> know how key/value design can take advantage of these >>>>>>> Accumulo >>>>>>> features. Thanks a lot! >>>>>>> >>>>>>> Jared >>>>>>> >>>>>>> >>>>>>> On Fri, May 3, 2013 at 1:24 PM, Eric Newton >>>>>>> <[email protected] <mailto:[email protected]>**> >>>>>>> wrote: >>>>>>> >>>>>>> I think David Medinets suggested some publicly available >>>>>>> data sources that could be used to compare the storage >>>>>>> requirements of different key/value stores. >>>>>>> >>>>>>> Today I tried it out. >>>>>>> >>>>>>> I took the google 1-gram word lists and ingested them >>>>>>> into >>>>>>> accumulo. >>>>>>> >>>>>>> >>>>>>> http://storage.googleapis.com/**books/ngrams/books/datasetsv2.**html<http://storage.googleapis.com/books/ngrams/books/datasetsv2.html> >>>>>>> >>>>>>> It took about 15 minutes to ingest on a 10 node cluster >>>>>>> (4 >>>>>>> drives each). >>>>>>> >>>>>>> $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-** >>>>>>> grams >>>>>>> running... >>>>>>> 5.2 G /data/googlebooks/ngrams/1-**grams >>>>>>> >>>>>>> $ hadoop fs -du -s -h /accumulo/tables/4 >>>>>>> running... >>>>>>> 4.1 G /accumulo/tables/4 >>>>>>> >>>>>>> The storage format in accumulo is about 20% more >>>>>>> efficient >>>>>>> than gzip'd csv files. >>>>>>> >>>>>>> I'll post the 2-gram results sometime next month when >>>>>>> its >>>>>>> done downloading. :-) >>>>>>> >>>>>>> -Eric, which occurred 221K times in 34K books in 2008. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >> >
