[twitter-dev] Re: Help estimating tweets per day...

Scott Haneda Wed, 14 Oct 2009 14:06:29 -0700


On Oct 14, 2009, at 8:38 AM, Kyle B wrote:

I am creating a mathematical model based on some results from
Twitter's API, but I am missing one critical number in the model.  I
need to estimate the number of total tweets in the USA each day. The
better an estimate I get and the less assumptions I make, the more
useful the model will be (it will be published for the public to
use).  I have been told that this type of information is important and
usually kept secret by internet start ups.  Understanding this, I have
come up with a work around that is not yet accurate enough so I am
looking for your advice.

I am very interested in this data, as it is a metric I will need witha service I am working on. I had no idea how to get this accurately,so I was going to look for services that are making educated guesses.

Idea:

I gather data from Twitter's search API at least once an hour.

I do not think this is your best approach. The search API seems tohave a lot of flux to it. The public timeline would be better, butstill not perfect. I think you want the streaming API, which I willelaborate on below.

My
idea is to store the first tweet ID I see each day, and subtract it
from the ID of the previous day to estimate the number of tweets per
day.  I have three problems here:

1. How are tweet IDs incremented?  Do they increase by a factor of 1,
2, 5, 10...?

It does not matter. I would guess they are incremented by one. Justgiven the 32/64 bit counter issue, it would appear it has to be 1, orthey would have breached the 32 bit limits a long time ago.

This does not take into account any number of technical things Twittermay do on their end such as distributed databases, completely de-normalized data in order to deal with the massive volume they have,and most importantly, users deleting tweets.

2. I need an estimate for the number of private/protected users
assuming each private user's tweet gets an ID number.  This is
required because I am sampling the public tweets.

I am not sure you need this in your calculation, if you only readpublic tweets, and have a way to count them accurately, there shouldnot be a need to subtract out any private ones. I do not believe youcan ever arrive at an accurate count by subtracting first and lasttweet. You have time zones to content with as well.

3. I need to estimate the number of tweets coming from overseas.  I am
modeling the USA.  This is less of a problem than the previous two.

This will be hard, as there is no mandate stating you must set yourlocation, let alone set it to something accurate. It has beensuggested that changing your location to a false one is a good way totrick some politically charged countries from preventingconversational discourse.

If you look at the streaming API, you can define a set of parametersthat will allow a stream of data to come in. It is a lot of data, inmy opinion, a boatload of data. It also sounds just like what youneed. Hit the streaming API, open a socket, and start reading in thedata. Of course, you will not want to read it all, but determine somebatch you want to grab, and some schedule you want to grab it on.

You can then extrapolate your numbers from there. Perhaps a 1 minuteread of the data every 5 minutes would get you where you need to go.Then you could determine patterns in usage and adjust accordingly.

From what I understand about the streaming API, is that it is in facta full stream, always on. Not only do you have to make sure thatstream stays open, and reconnect it if it closes, consider it to be ayoutube video playing all day long. Even if you only want a chunk ofthe data, you are still moving all that data across your wire. If youare in any way bandwidth constrained, be sure to be careful, yourbills and resources could go through the roof.

This of course is just a small technical hurdle, you could open andclose the stream as needed, but you may sacrifice some accuracy bydoing so.

If there is any chance you could contact me off list, address below,and keep me posted on your data when it goes public, I would be veryappreciative.

--
Scott * If you contact me off list replace talklists@ with scott@ *

[twitter-dev] Re: Help estimating tweets per day...

Reply via email to