[twitter-dev] Re: Help estimating tweets per day...
On Oct 14, 2009, at 8:38 AM, Kyle B wrote: I am creating a mathematical model based on some results from Twitter's API, but I am missing one critical number in the model. I need to estimate the number of total tweets in the USA each day. The better an estimate I get and the less assumptions I make, the more useful the model will be (it will be published for the public to use). I have been told that this type of information is important and usually kept secret by internet start ups. Understanding this, I have come up with a work around that is not yet accurate enough so I am looking for your advice. I am very interested in this data, as it is a metric I will need with a service I am working on. I had no idea how to get this accurately, so I was going to look for services that are making educated guesses. Idea: I gather data from Twitter's search API at least once an hour. I do not think this is your best approach. The search API seems to have a lot of flux to it. The public timeline would be better, but still not perfect. I think you want the streaming API, which I will elaborate on below. My idea is to store the first tweet ID I see each day, and subtract it from the ID of the previous day to estimate the number of tweets per day. I have three problems here: 1. How are tweet IDs incremented? Do they increase by a factor of 1, 2, 5, 10...? It does not matter. I would guess they are incremented by one. Just given the 32/64 bit counter issue, it would appear it has to be 1, or they would have breached the 32 bit limits a long time ago. This does not take into account any number of technical things Twitter may do on their end such as distributed databases, completely de- normalized data in order to deal with the massive volume they have, and most importantly, users deleting tweets. 2. I need an estimate for the number of private/protected users assuming each private user's tweet gets an ID number. This is required because I am sampling the public tweets. I am not sure you need this in your calculation, if you only read public tweets, and have a way to count them accurately, there should not be a need to subtract out any private ones. I do not believe you can ever arrive at an accurate count by subtracting first and last tweet. You have time zones to content with as well. 3. I need to estimate the number of tweets coming from overseas. I am modeling the USA. This is less of a problem than the previous two. This will be hard, as there is no mandate stating you must set your location, let alone set it to something accurate. It has been suggested that changing your location to a false one is a good way to trick some politically charged countries from preventing conversational discourse. If you look at the streaming API, you can define a set of parameters that will allow a stream of data to come in. It is a lot of data, in my opinion, a boatload of data. It also sounds just like what you need. Hit the streaming API, open a socket, and start reading in the data. Of course, you will not want to read it all, but determine some batch you want to grab, and some schedule you want to grab it on. You can then extrapolate your numbers from there. Perhaps a 1 minute read of the data every 5 minutes would get you where you need to go. Then you could determine patterns in usage and adjust accordingly. From what I understand about the streaming API, is that it is in fact a full stream, always on. Not only do you have to make sure that stream stays open, and reconnect it if it closes, consider it to be a youtube video playing all day long. Even if you only want a chunk of the data, you are still moving all that data across your wire. If you are in any way bandwidth constrained, be sure to be careful, your bills and resources could go through the roof. This of course is just a small technical hurdle, you could open and close the stream as needed, but you may sacrifice some accuracy by doing so. If there is any chance you could contact me off list, address below, and keep me posted on your data when it goes public, I would be very appreciative. -- Scott * If you contact me off list replace talklists@ with scott@ *
[twitter-dev] Re: Help estimating tweets per day...
On Wed, Oct 14, 2009 at 8:38 AM, Kyle B kylebarn...@gmail.com wrote: 1. How are tweet IDs incremented? Do they increase by a factor of 1, 2, 5, 10...? I've asked that question previously and the answer was a definitive We aren't telling. It seems to be considered a significant enough trade secret that I wouldn't be at all surprised if they are skipping IDs randomly to prevent people from doing exactly what you're seeking to do. Nor would I be surprised if they refuse to say a word about it now. Short of figuring out an indirect approach, I don't think you'll be able to come up with an accurate number. Nick
[twitter-dev] Re: Help estimating tweets per day...
Thanks for the info. It helps a lot. Figuring out an accurate number is essential to my model, so much so that I am determined to find some method of estimating it to acceptable margins of error! - Kyle On Oct 14, 5:19 pm, Nick Arnett nick.arn...@gmail.com wrote: On Wed, Oct 14, 2009 at 8:38 AM, Kyle B kylebarn...@gmail.com wrote: 1. How are tweet IDs incremented? Do they increase by a factor of 1, 2, 5, 10...? I've asked that question previously and the answer was a definitive We aren't telling. It seems to be considered a significant enough trade secret that I wouldn't be at all surprised if they are skipping IDs randomly to prevent people from doing exactly what you're seeking to do. Nor would I be surprised if they refuse to say a word about it now. Short of figuring out an indirect approach, I don't think you'll be able to come up with an accurate number. Nick
[twitter-dev] Re: Help estimating tweets per day...
And you don't think the streaming API will answer that for you? -- Scott * If you contact me off list replace talklists@ with scott@ * On Oct 14, 2009, at 3:27 PM, Kyle B wrote: Thanks for the info. It helps a lot. Figuring out an accurate number is essential to my model, so much so that I am determined to find some method of estimating it to acceptable margins of error!
[twitter-dev] Re: Help estimating tweets per day...
On Wed, Oct 14, 2009 at 3:27 PM, Kyle B kylebarn...@gmail.com wrote: Thanks for the info. It helps a lot. Figuring out an accurate number is essential to my model, so much so that I am determined to find some method of estimating it to acceptable margins of error! It occurs to me that perhaps this might not be so hard... and please do share your results with us. Just test a good-sized sample of IDs and see how many don't exist. That will give you an idea of how many there really are. I'll be curious to see if you get consistent results from one day to the next. I won't be too surprised to see if you don't, which would mean that Twitter is skipping a random (or at least somewhat random) number of IDs each day. However, if you want to continue to know this number, you'll have to continue to sample. And your sample might have to span multiple days to get a reliable answer. And I hate to say this, because if they're not already doing it, this might make them start... Twitter could be monitoring for any process that repeatedly asks for deliberately non-existent IDs, in order to block them, to maintain the obfuscation. Then you're stuck again, unless you can find a way around that defense. Assuming there are millions of IDs a day, you'll need a pretty good sample size if you want to maintain a good number. The good news in all this is that IIRC, Twitter has guaranteed that IDs will increase chronologically. The bad news is that I'm writing this off the top of my head and there's probably an easy defense I haven't thought of, which somebody at Twitter will think of just because they see this conversation. Put 'em on double-secret probation, I say. Nick
[twitter-dev] Re: Help estimating tweets per day...
On Wed, Oct 14, 2009 at 3:56 PM, Scott Haneda talkli...@newgeo.com wrote: And you don't think the streaming API will answer that for you? It can't, can it? It isn't the complete stream, only a sampled subset. There's no way to know which IDs were skipped in order to obfuscate the actual number of tweets. A missing ID could either just have not been sampled or not exist. Nick
[twitter-dev] Re: Help estimating tweets per day...
On Wed, Oct 14, 2009 at 4:10 PM, Nick Arnett nick.arn...@gmail.com wrote: On Wed, Oct 14, 2009 at 3:27 PM, Kyle B kylebarn...@gmail.com wrote: Thanks for the info. It helps a lot. Figuring out an accurate number is essential to my model, so much so that I am determined to find some method of estimating it to acceptable margins of error! Couple of more thoughts dawned on me. If the approach I'm suggesting violates the TOS, please realize that it is not my intention to encourage anybody to violate the TOS. Second, thinking more evil-like, one way about the kind of defense I imagined would be to distribute the problem -- find a bunch of people who would like the same data and coordinate the testing to see what percentage of IDs actually exist. Did I just describe a DDOS? Please, no. Another possible evil defense -- there's a fake tweet generator at Twitter, really messing with the statistics; tweets that are ONLY visible to people who try to retrieve them via IDs that appear nowhere in public. A honey-trap, in other words. I've spent too much time working with intelligence agencies. Nick