Here is an up-to-date estimate. I naively reported disk usage as the "Disk Used" field under the Accumulo Master section of the monitor. Currently it appears I am only actually using ~26 GB of storage for my Accumulo tables. This is based on the "% Used" * "Unreplicated Capacity" fields in the NameNode section of the monitor which is also corroborated by looking the the file system usage for the HDFS data directories. I have no other data in HDFS.
Dec 24 - Apr 30 = 128 days 3.0 billion entries / 128 days = 23.4 million entries/day 23.4 million entries/day / 1.2 million tweets/day ~ 20 entries/tweet (not sure if I misrepresented the number of tweets per day as 3 million before, but it is about 1.2) 26GB / ( 128 * 1.2e6 ) ~ 182 bytes/tweet I am using the VARLEN encoding for the SummingCombiner which probably helps save a lot of space as I would imagine there are a lot of entries with a very small count as the language used on Twitter is far from normal. On Fri, Apr 27, 2012 at 1:09 PM, Eric Newton <[email protected]> wrote: > > On Wed, Apr 25, 2012 at 3:10 PM, Jared winick <[email protected]>wrote: > >> I am not exactly sure how to answer the question about storage size per >> tweet as I am not actually storing the original tweet and if a counter >> already exists for an n-gram/time period, then incrementing that counter >> doesn't increase the storage size. I can follow up with the current storage >> I am using though. >> > > I see I can make some estimates based on the information in your talk. The > slides are awesome, btw. > > Using the information you provided: Dec 24 - March 12... that's 88 days. > 2.6e9 entries, 3 million-ish tweets per day: > > 2.6e9 / (3e6 * 88) > > ~10 entries per tweet. > > Also, you report disk usage of 72G, which I will interpret as 72 * (1024 > ** 3) bytes. > > So, each tweet, on average occupies: 72G / (88 * 3e6) Or, ~300 bytes. > > -Eric >
