Re: Trendulo - A Twitter Analytics Demo on Accumulo

Jared winick Wed, 25 Apr 2012 12:10:42 -0700

So it is pretty brute force at ingest time to enable queries to be fast and
efficient. For each tweet it builds all 1,2, and 3-grams from the message
in the tweet. So an example message of:

"i can has cheezburger"

would be translated into the following n-grams

"i", "can", "has", "cheezburger", "i can", "can has", "has cheezburger", "i
can has", "can has cheezburger"

then for each n-gram, it keeps a daily and hourly counter using a
SummingCombiner. The data model looks like:

rowId: n-gram
cf: DAY or HOUR
cq: date value (ex. 20120425)
value: counter

so a single tweet turns into many key-values for each n-gram/time period. I
would have to verify but on average I think it works out to about 1 tweet
to 60 key-values. I end up seeing from a few hundred entries/sec inserted
in the middle of the night to about 2000 entries/sec during peak evening
times.

I am not exactly sure how to answer the question about storage size per
tweet as I am not actually storing the original tweet and if a counter
already exists for an n-gram/time period, then incrementing that counter
doesn't increase the storage size. I can follow up with the current storage
I am using though.

Aaron, I am using EBS now and I haven't seen any problems, that said my
load is obviously not extreme.  When I initially moved things from my home
workstation to EC2, I had a few months of tweets to ingest. For that
initial ingest I did run with local instance storage as I saw extremely
variable performance when I first tried EBS. The instance storage was
better, though not as good as what I see on bare metal.

Jared

On Wed, Apr 25, 2012 at 7:43 AM, Aaron Cordova <[email protected]> wrote:

> Speaking of storage - are you using EBS or local instance storage?
>
> On Apr 25, 2012, at 8:52 AM, Eric Newton wrote:
>
> How many key-values does a single tweet become, on average?  What's the
> storage size per tweet?
>
> On Wed, Apr 25, 2012 at 12:17 AM, Jared winick <[email protected]>wrote:
>
>> Thanks for the kind words, I appreciate it. Keith, my ingest process
>> was down on Mar 19-20, so that is why I am missing data for that
>> period.
>>
>> For those who are curious, I am receiving about 1.2 million tweets a
>> day and have about 3 billion entries in my main table.  I am actually
>> getting by with everything running on an EC2 medium instance, which is
>> obviously very far from ideal but I am trying to stay on a budget.
>>
>> I hope to add new features as time allows, things like near real-time
>> trending and geospatial analytics.  If anyone has any ideas for
>> features they think would be interesting, just let me know or add them
>> as issues on the github page.
>>
>> On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi
>> <[email protected]> wrote:
>> > That's so cool that I'm creating a new section for it on our page of
>> links:
>> > http://accumulo.apache.org/papers.html
>> >
>> > Billie
>> >
>> > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" <
>> [email protected]> wrote:
>> >> I gave an Introduction to Apache Accumulo presentation last month at
>> >> the Boulder/Denver Meetup where I demoed an application that used
>> >> Accumulo to provide real-time and historical access to words/phrases
>> >> seen in Twitter messages as well as daily trend analysis. I finally
>> >> got the demo polished up a bit and running on Amazon EC2 where it can
>> >> be found at http://trendulo.com .
>> >>
>> >> Trendulo is still pretty Alpha at this point so please feel free to
>> >> add to the existing documented issues at
>> >> https://github.com/jaredwinick/trendulo where you can also obviously
>> >> find the source.
>> >>
>> >>
>> >> As an example, the following link will show the launch of Instagram's
>> >> Android client, followed by Facebook's purchase and then a small
>> >> increase in general "chatter" about the product http://goo.gl/XcCG8
>> >>
>> >>
>> >> Let me know if anyone has any questions or comments. Feel free to
>> >> tweet @trendulo any interesting searches and I can retweet them out.
>> >>
>> >>
>> >> Jared
>>
>
>
>

Re: Trendulo - A Twitter Analytics Demo on Accumulo

Reply via email to