Re: interesting

Eric Newton Wed, 15 May 2013 07:58:43 -0700

I ingested the 2-gram data on a 10 node cluster.  It took just under 7
hours.  For most of the job, accumulo ingested at about 200K k-v/server.


$ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
/accumulo/tables/2  74632273653
/data/n-grams/2-grams 154271541304

That's a very nice result.  RFile compressed the same data to half the
gzip'd CSV format.

There are 37,582,158,107 entries in the 2-gram set, which means that
accumulo is using only 2 bytes for each entry.

-Eric Newton, which appeared 62 times in 37 books in 2008.


On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected]> wrote:

> ngram == row
> year == column family
> count == column qualifier (prepended with zeros for sort)
> book count == value
>
> I used ascii text for the counts, even.
>
> I'm not sure if the google entries are sorted, so the sort would help
> compression.
>
> The RFile format does not repeat identical data from key to key, so in
> most cases, the row is not repeated.  That gives gzip other things to work
> on.
>
> I'll have to do more analysis to figure out why RFile did so well.
>  Perhaps google used less aggressive settings for their compression.
>
> I'm more interested in 2-grams to test our partial-row compression in 1.5.
>
> -Eric
>
>
> On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[email protected]>wrote:
>
>> That is very interesting and sounds like a fun friday project! Could you
>> please elaborate on how you mapped the original format of
>>
>> ngram TAB year TAB match_count TAB volume_count NEWLINE
>>
>> into Accumulo key/values? Could you briefly explain what feature in
>> Accumulo is responsible for this improvement in storage efficiency. This
>> could be a helpful illustration for users to know how key/value design can
>> take advantage of these Accumulo features. Thanks a lot!
>>
>> Jared
>>
>>
>> On Fri, May 3, 2013 at 1:24 PM, Eric Newton <[email protected]>wrote:
>>
>>> I think David Medinets suggested some publicly available data sources
>>> that could be used to compare the storage requirements of different
>>> key/value stores.
>>>
>>> Today I tried it out.
>>>
>>> I took the google 1-gram word lists and ingested them into accumulo.
>>>
>>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>>>
>>> It took about 15 minutes to ingest on a 10 node cluster (4 drives each).
>>>
>>> $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
>>> running...
>>> 5.2 G  /data/googlebooks/ngrams/1-grams
>>>
>>> $ hadoop fs -du -s -h /accumulo/tables/4
>>> running...
>>> 4.1 G  /accumulo/tables/4
>>>
>>> The storage format in accumulo is about 20% more efficient than gzip'd
>>> csv files.
>>>
>>> I'll post the 2-gram results sometime next month when its done
>>> downloading. :-)
>>>
>>> -Eric, which occurred 221K times in 34K books in 2008.
>>>
>>
>>
>

Re: interesting

Reply via email to