I am creating a mathematical model based on some results from
Twitter's API, but I am missing one critical number in the model. I
need to estimate the number of total tweets in the USA each day. The
better an estimate I get and the less assumptions I make, the more
useful the model will be (it will be published for the public to
use). I have been told that this type of information is important and
usually kept secret by internet start ups. Understanding this, I have
come up with a work around that is not yet accurate enough so I am
looking for your advice.
I gather data from Twitter's search API at least once an hour. My
idea is to store the first tweet ID I see each day, and subtract it
from the ID of the previous day to estimate the number of tweets per
day. I have three problems here:
1. How are tweet IDs incremented? Do they increase by a factor of 1,
2, 5, 10...?
2. I need an estimate for the number of private/protected users
assuming each private user's tweet gets an ID number. This is
required because I am sampling the public tweets.
3. I need to estimate the number of tweets coming from overseas. I am
modeling the USA. This is less of a problem than the previous two.
Thanks for your time. Any help/advice is appreciated!