Re: interesting

Eric Newton Mon, 20 May 2013 09:35:20 -0700

The version that will be 1.5.0... more-or-less.


On Sun, May 19, 2013 at 10:13 PM, Jim Klucar <[email protected]> wrote:

> Eric, what version of Accumulo did you use? I'm assuming 1.5.0
>
>
> On Wed, May 15, 2013 at 5:20 PM, Josh Elser <[email protected]> wrote:
>
>> Definitely, with a note on the ingest job duration, too.
>>
>>
>> On 05/15/2013 04:27 PM, Christopher wrote:
>>
>>> I'd be very curious how something faster, like Snappy, compared.
>>>
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>>
>>>
>>> On Wed, May 15, 2013 at 2:52 PM, Eric Newton <[email protected]>
>>> wrote:
>>>
>>>> I don't intend to do that.
>>>>
>>>>
>>>> On Wed, May 15, 2013 at 12:11 PM, Josh Elser <[email protected]>
>>>> wrote:
>>>>
>>>>> Just kidding, re-read the rest of this. Let me try again:
>>>>>
>>>>> Any intents to retry this with different compression codecs?
>>>>>
>>>>>
>>>>> On 5/15/13 12:00 PM, Josh Elser wrote:
>>>>>
>>>>>> RFile... with gzip? Or did you use another compressor?
>>>>>>
>>>>>> On 5/15/13 10:58 AM, Eric Newton wrote:
>>>>>>
>>>>>>> I ingested the 2-gram data on a 10 node cluster.  It took just under
>>>>>>> 7
>>>>>>> hours.  For most of the job, accumulo ingested at about 200K
>>>>>>> k-v/server.
>>>>>>>
>>>>>>> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
>>>>>>> /accumulo/tables/274632273653
>>>>>>> /data/n-grams/2-**grams154271541304
>>>>>>>
>>>>>>> That's a very nice result.  RFile compressed the same data to half
>>>>>>> the
>>>>>>> gzip'd CSV format.
>>>>>>>
>>>>>>> There are 37,582,158,107 entries in the 2-gram set, which means that
>>>>>>> accumulo is using only 2 bytes for each entry.
>>>>>>>
>>>>>>> -Eric Newton, which appeared 62 times in 37 books in 2008.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[email protected]
>>>>>>> <mailto:[email protected]>**> wrote:
>>>>>>>
>>>>>>>      ngram == row
>>>>>>>      year == column family
>>>>>>>      count == column qualifier (prepended with zeros for sort)
>>>>>>>      book count == value
>>>>>>>
>>>>>>>      I used ascii text for the counts, even.
>>>>>>>
>>>>>>>      I'm not sure if the google entries are sorted, so the sort would
>>>>>>>      help compression.
>>>>>>>
>>>>>>>      The RFile format does not repeat identical data from key to
>>>>>>> key, so
>>>>>>>      in most cases, the row is not repeated.  That gives gzip other
>>>>>>>      things to work on.
>>>>>>>
>>>>>>>      I'll have to do more analysis to figure out why RFile did so
>>>>>>> well.
>>>>>>>        Perhaps google used less aggressive settings for their
>>>>>>> compression.
>>>>>>>
>>>>>>>      I'm more interested in 2-grams to test our partial-row
>>>>>>> compression
>>>>>>>      in 1.5.
>>>>>>>
>>>>>>>      -Eric
>>>>>>>
>>>>>>>
>>>>>>>      On Fri, May 3, 2013 at 4:09 PM, Jared Winick <
>>>>>>> [email protected]
>>>>>>>      <mailto:[email protected]>**> wrote:
>>>>>>>
>>>>>>>          That is very interesting and sounds like a fun friday
>>>>>>> project!
>>>>>>>          Could you please elaborate on how you mapped the original
>>>>>>> format of
>>>>>>>
>>>>>>>          ngram TAB year TAB match_count TAB volume_count NEWLINE
>>>>>>>
>>>>>>>          into Accumulo key/values? Could you briefly explain what
>>>>>>> feature
>>>>>>>          in Accumulo is responsible for this improvement in storage
>>>>>>>          efficiency. This could be a helpful illustration for users
>>>>>>> to
>>>>>>>          know how key/value design can take advantage of these
>>>>>>> Accumulo
>>>>>>>          features. Thanks a lot!
>>>>>>>
>>>>>>>          Jared
>>>>>>>
>>>>>>>
>>>>>>>          On Fri, May 3, 2013 at 1:24 PM, Eric Newton
>>>>>>>          <[email protected] <mailto:[email protected]>**>
>>>>>>> wrote:
>>>>>>>
>>>>>>>              I think David Medinets suggested some publicly available
>>>>>>>              data sources that could be used to compare the storage
>>>>>>>              requirements of different key/value stores.
>>>>>>>
>>>>>>>              Today I tried it out.
>>>>>>>
>>>>>>>              I took the google 1-gram word lists and ingested them
>>>>>>> into
>>>>>>>              accumulo.
>>>>>>>
>>>>>>>
>>>>>>> http://storage.googleapis.com/**books/ngrams/books/datasetsv2.**html<http://storage.googleapis.com/books/ngrams/books/datasetsv2.html>
>>>>>>>
>>>>>>>              It took about 15 minutes to ingest on a 10 node cluster
>>>>>>> (4
>>>>>>>              drives each).
>>>>>>>
>>>>>>>              $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-**
>>>>>>> grams
>>>>>>>              running...
>>>>>>>              5.2 G  /data/googlebooks/ngrams/1-**grams
>>>>>>>
>>>>>>>              $ hadoop fs -du -s -h /accumulo/tables/4
>>>>>>>              running...
>>>>>>>              4.1 G  /accumulo/tables/4
>>>>>>>
>>>>>>>              The storage format in accumulo is about 20% more
>>>>>>> efficient
>>>>>>>              than gzip'd csv files.
>>>>>>>
>>>>>>>              I'll post the 2-gram results sometime next month when
>>>>>>> its
>>>>>>>              done downloading. :-)
>>>>>>>
>>>>>>>              -Eric, which occurred 221K times in 34K books in 2008.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>
>

Re: interesting

Reply via email to