On Oct 14, 2009, at 8:38 AM, Kyle B wrote:
I am creating a mathematical model based on some results from Twitter's API, but I am missing one critical number in the model. I need to estimate the number of total tweets in the USA each day. The better an estimate I get and the less assumptions I make, the more useful the model will be (it will be published for the public to use). I have been told that this type of information is important and usually kept secret by internet start ups. Understanding this, I have come up with a work around that is not yet accurate enough so I am looking for your advice.
I am very interested in this data, as it is a metric I will need with a service I am working on. I had no idea how to get this accurately, so I was going to look for services that are making educated guesses.
Idea: I gather data from Twitter's search API at least once an hour.
I do not think this is your best approach. The search API seems to have a lot of flux to it. The public timeline would be better, but still not perfect. I think you want the streaming API, which I will elaborate on below.
My idea is to store the first tweet ID I see each day, and subtract it from the ID of the previous day to estimate the number of tweets per day. I have three problems here: 1. How are tweet IDs incremented? Do they increase by a factor of 1, 2, 5, 10...?
It does not matter. I would guess they are incremented by one. Just given the 32/64 bit counter issue, it would appear it has to be 1, or they would have breached the 32 bit limits a long time ago.
This does not take into account any number of technical things Twitter may do on their end such as distributed databases, completely de- normalized data in order to deal with the massive volume they have, and most importantly, users deleting tweets.
2. I need an estimate for the number of private/protected users assuming each private user's tweet gets an ID number. This is required because I am sampling the public tweets.
I am not sure you need this in your calculation, if you only read public tweets, and have a way to count them accurately, there should not be a need to subtract out any private ones. I do not believe you can ever arrive at an accurate count by subtracting first and last tweet. You have time zones to content with as well.
3. I need to estimate the number of tweets coming from overseas. I am modeling the USA. This is less of a problem than the previous two.
This will be hard, as there is no mandate stating you must set your location, let alone set it to something accurate. It has been suggested that changing your location to a false one is a good way to trick some politically charged countries from preventing conversational discourse.
If you look at the streaming API, you can define a set of parameters that will allow a stream of data to come in. It is a lot of data, in my opinion, a boatload of data. It also sounds just like what you need. Hit the streaming API, open a socket, and start reading in the data. Of course, you will not want to read it all, but determine some batch you want to grab, and some schedule you want to grab it on.
You can then extrapolate your numbers from there. Perhaps a 1 minute read of the data every 5 minutes would get you where you need to go. Then you could determine patterns in usage and adjust accordingly.
From what I understand about the streaming API, is that it is in fact a full stream, always on. Not only do you have to make sure that stream stays open, and reconnect it if it closes, consider it to be a youtube video playing all day long. Even if you only want a chunk of the data, you are still moving all that data across your wire. If you are in any way bandwidth constrained, be sure to be careful, your bills and resources could go through the roof.
This of course is just a small technical hurdle, you could open and close the stream as needed, but you may sacrifice some accuracy by doing so.
If there is any chance you could contact me off list, address below, and keep me posted on your data when it goes public, I would be very appreciative.
-- Scott * If you contact me off list replace talklists@ with scott@ *