Re: interesting

Eric Newton Wed, 15 May 2013 11:58:47 -0700

I don't intend to do that.


On Wed, May 15, 2013 at 12:11 PM, Josh Elser <[email protected]> wrote:

> Just kidding, re-read the rest of this. Let me try again:
>
> Any intents to retry this with different compression codecs?
>
>
> On 5/15/13 12:00 PM, Josh Elser wrote:
>
>> RFile... with gzip? Or did you use another compressor?
>>
>> On 5/15/13 10:58 AM, Eric Newton wrote:
>>
>>> I ingested the 2-gram data on a 10 node cluster.  It took just under 7
>>> hours.  For most of the job, accumulo ingested at about 200K k-v/server.
>>>
>>> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
>>> /accumulo/tables/274632273653
>>> /data/n-grams/2-**grams154271541304
>>>
>>> That's a very nice result.  RFile compressed the same data to half the
>>> gzip'd CSV format.
>>>
>>> There are 37,582,158,107 entries in the 2-gram set, which means that
>>> accumulo is using only 2 bytes for each entry.
>>>
>>> -Eric Newton, which appeared 62 times in 37 books in 2008.
>>>
>>>
>>> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected]
>>> <mailto:[email protected]>**> wrote:
>>>
>>>     ngram == row
>>>     year == column family
>>>     count == column qualifier (prepended with zeros for sort)
>>>     book count == value
>>>
>>>     I used ascii text for the counts, even.
>>>
>>>     I'm not sure if the google entries are sorted, so the sort would
>>>     help compression.
>>>
>>>     The RFile format does not repeat identical data from key to key, so
>>>     in most cases, the row is not repeated.  That gives gzip other
>>>     things to work on.
>>>
>>>     I'll have to do more analysis to figure out why RFile did so well.
>>>       Perhaps google used less aggressive settings for their compression.
>>>
>>>     I'm more interested in 2-grams to test our partial-row compression
>>>     in 1.5.
>>>
>>>     -Eric
>>>
>>>
>>>     On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[email protected]
>>>     <mailto:[email protected]>**> wrote:
>>>
>>>         That is very interesting and sounds like a fun friday project!
>>>         Could you please elaborate on how you mapped the original
>>> format of
>>>
>>>         ngram TAB year TAB match_count TAB volume_count NEWLINE
>>>
>>>         into Accumulo key/values? Could you briefly explain what feature
>>>         in Accumulo is responsible for this improvement in storage
>>>         efficiency. This could be a helpful illustration for users to
>>>         know how key/value design can take advantage of these Accumulo
>>>         features. Thanks a lot!
>>>
>>>         Jared
>>>
>>>
>>>         On Fri, May 3, 2013 at 1:24 PM, Eric Newton
>>>         <[email protected] <mailto:[email protected]>**> wrote:
>>>
>>>             I think David Medinets suggested some publicly available
>>>             data sources that could be used to compare the storage
>>>             requirements of different key/value stores.
>>>
>>>             Today I tried it out.
>>>
>>>             I took the google 1-gram word lists and ingested them into
>>>             accumulo.
>>>
>>>
>>> http://storage.googleapis.com/**books/ngrams/books/datasetsv2.**html<http://storage.googleapis.com/books/ngrams/books/datasetsv2.html>
>>>
>>>             It took about 15 minutes to ingest on a 10 node cluster (4
>>>             drives each).
>>>
>>>             $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-**grams
>>>             running...
>>>             5.2 G  /data/googlebooks/ngrams/1-**grams
>>>
>>>             $ hadoop fs -du -s -h /accumulo/tables/4
>>>             running...
>>>             4.1 G  /accumulo/tables/4
>>>
>>>             The storage format in accumulo is about 20% more efficient
>>>             than gzip'd csv files.
>>>
>>>             I'll post the 2-gram results sometime next month when its
>>>             done downloading. :-)
>>>
>>>             -Eric, which occurred 221K times in 34K books in 2008.
>>>
>>>
>>>
>>>
>>>

Re: interesting

Reply via email to