So it is pretty brute force at ingest time to enable queries to be fast and efficient. For each tweet it builds all 1,2, and 3-grams from the message in the tweet. So an example message of:
"i can has cheezburger" would be translated into the following n-grams "i", "can", "has", "cheezburger", "i can", "can has", "has cheezburger", "i can has", "can has cheezburger" then for each n-gram, it keeps a daily and hourly counter using a SummingCombiner. The data model looks like: rowId: n-gram cf: DAY or HOUR cq: date value (ex. 20120425) value: counter so a single tweet turns into many key-values for each n-gram/time period. I would have to verify but on average I think it works out to about 1 tweet to 60 key-values. I end up seeing from a few hundred entries/sec inserted in the middle of the night to about 2000 entries/sec during peak evening times. I am not exactly sure how to answer the question about storage size per tweet as I am not actually storing the original tweet and if a counter already exists for an n-gram/time period, then incrementing that counter doesn't increase the storage size. I can follow up with the current storage I am using though. Aaron, I am using EBS now and I haven't seen any problems, that said my load is obviously not extreme. When I initially moved things from my home workstation to EC2, I had a few months of tweets to ingest. For that initial ingest I did run with local instance storage as I saw extremely variable performance when I first tried EBS. The instance storage was better, though not as good as what I see on bare metal. Jared On Wed, Apr 25, 2012 at 7:43 AM, Aaron Cordova <[email protected]> wrote: > Speaking of storage - are you using EBS or local instance storage? > > On Apr 25, 2012, at 8:52 AM, Eric Newton wrote: > > How many key-values does a single tweet become, on average? What's the > storage size per tweet? > > On Wed, Apr 25, 2012 at 12:17 AM, Jared winick <[email protected]>wrote: > >> Thanks for the kind words, I appreciate it. Keith, my ingest process >> was down on Mar 19-20, so that is why I am missing data for that >> period. >> >> For those who are curious, I am receiving about 1.2 million tweets a >> day and have about 3 billion entries in my main table. I am actually >> getting by with everything running on an EC2 medium instance, which is >> obviously very far from ideal but I am trying to stay on a budget. >> >> I hope to add new features as time allows, things like near real-time >> trending and geospatial analytics. If anyone has any ideas for >> features they think would be interesting, just let me know or add them >> as issues on the github page. >> >> On Tue, Apr 24, 2012 at 11:40 AM, Billie J Rinaldi >> <[email protected]> wrote: >> > That's so cool that I'm creating a new section for it on our page of >> links: >> > http://accumulo.apache.org/papers.html >> > >> > Billie >> > >> > On Tuesday, April 24, 2012 9:35:31 AM, "Jared winick" < >> [email protected]> wrote: >> >> I gave an Introduction to Apache Accumulo presentation last month at >> >> the Boulder/Denver Meetup where I demoed an application that used >> >> Accumulo to provide real-time and historical access to words/phrases >> >> seen in Twitter messages as well as daily trend analysis. I finally >> >> got the demo polished up a bit and running on Amazon EC2 where it can >> >> be found at http://trendulo.com . >> >> >> >> Trendulo is still pretty Alpha at this point so please feel free to >> >> add to the existing documented issues at >> >> https://github.com/jaredwinick/trendulo where you can also obviously >> >> find the source. >> >> >> >> >> >> As an example, the following link will show the launch of Instagram's >> >> Android client, followed by Facebook's purchase and then a small >> >> increase in general "chatter" about the product http://goo.gl/XcCG8 >> >> >> >> >> >> Let me know if anyone has any questions or comments. Feel free to >> >> tweet @trendulo any interesting searches and I can retweet them out. >> >> >> >> >> >> Jared >> > > >
