[twitter-dev] Re: Help estimating tweets per day...

2009-10-14 Thread Scott Haneda


On Oct 14, 2009, at 8:38 AM, Kyle B wrote:


I am creating a mathematical model based on some results from
Twitter's API, but I am missing one critical number in the model.  I
need to estimate the number of total tweets in the USA each day. The
better an estimate I get and the less assumptions I make, the more
useful the model will be (it will be published for the public to
use).  I have been told that this type of information is important and
usually kept secret by internet start ups.  Understanding this, I have
come up with a work around that is not yet accurate enough so I am
looking for your advice.


I am very interested in this data, as it is a metric I will need with  
a service I am working on.  I had no idea how to get this accurately,  
so I was going to look for services that are making educated guesses.



Idea:

I gather data from Twitter's search API at least once an hour.


I do not think this is your best approach.  The search API seems to  
have a lot of flux to it.  The public timeline would be better, but  
still not perfect.  I think you want the streaming API, which I will  
elaborate on below.



My
idea is to store the first tweet ID I see each day, and subtract it
from the ID of the previous day to estimate the number of tweets per
day.  I have three problems here:

1. How are tweet IDs incremented?  Do they increase by a factor of 1,
2, 5, 10...?


It does not matter.  I would guess they are incremented by one. Just  
given the 32/64 bit counter issue, it would appear it has to be 1, or  
they would have breached the 32 bit limits a long time ago.


This does not take into account any number of technical things Twitter  
may do on their end such as distributed databases, completely de- 
normalized data in order to deal with the massive volume they have,  
and most importantly, users deleting tweets.



2. I need an estimate for the number of private/protected users
assuming each private user's tweet gets an ID number.  This is
required because I am sampling the public tweets.


I am not sure you need this in your calculation, if you only read  
public tweets, and have a way to count them accurately, there should  
not be a need to subtract out any private ones.  I do not believe you  
can ever arrive at an accurate count by subtracting first and last  
tweet.  You have time zones to content with as well.



3. I need to estimate the number of tweets coming from overseas.  I am
modeling the USA.  This is less of a problem than the previous two.


This will be hard, as there is no mandate stating you must set your  
location, let alone set it to something accurate.  It has been  
suggested that changing your location to a false one is a good way to  
trick some politically charged countries from preventing  
conversational discourse.


If you look at the streaming API, you can define a set of parameters  
that will allow a stream of data to come in.  It is a lot of data, in  
my opinion, a boatload of data.  It also sounds just like what you  
need.  Hit the streaming API, open a socket, and start reading in the  
data.  Of course, you will not want to read it all, but determine some  
batch you want to grab, and some schedule you want to grab it on.


You can then extrapolate your numbers from there.  Perhaps a 1 minute  
read of the data every 5 minutes would get you where you need to go.   
Then you could determine patterns in usage and adjust accordingly.


From what I understand about the streaming API, is that it is in fact  
a full stream, always on. Not only do you have to make sure that  
stream stays open, and reconnect it if it closes, consider it to be a  
youtube video playing all day long.  Even if you only want a chunk of  
the data, you are still moving all that data across your wire.  If you  
are in any way bandwidth constrained, be sure to be careful, your  
bills and resources could go through the roof.


This of course is just a small technical hurdle, you could open and  
close the stream as needed, but you may sacrifice some accuracy by  
doing so.


If there is any chance you could contact me off list, address below,  
and keep me posted on your data when it goes public, I would be very  
appreciative.

--
Scott * If you contact me off list replace talklists@ with scott@ *



[twitter-dev] Re: Help estimating tweets per day...

2009-10-14 Thread Nick Arnett
On Wed, Oct 14, 2009 at 8:38 AM, Kyle B kylebarn...@gmail.com wrote:



 1. How are tweet IDs incremented?  Do they increase by a factor of 1,
 2, 5, 10...?


I've asked that question previously and the answer was a definitive We
aren't telling.  It seems to be considered a significant enough trade
secret that I wouldn't be at all surprised if they are skipping IDs randomly
to prevent people from doing exactly what you're seeking to do.  Nor would I
be surprised if they refuse to say a word about it now.

Short of figuring out an indirect approach, I don't think you'll be able to
come up with an accurate number.

Nick


[twitter-dev] Re: Help estimating tweets per day...

2009-10-14 Thread Kyle B

Thanks for the info. It helps a lot.  Figuring out an accurate number
is essential to my model, so much so that I am determined to find some
method of estimating it to acceptable margins of error!

- Kyle


On Oct 14, 5:19 pm, Nick Arnett nick.arn...@gmail.com wrote:
 On Wed, Oct 14, 2009 at 8:38 AM, Kyle B kylebarn...@gmail.com wrote:

  1. How are tweet IDs incremented?  Do they increase by a factor of 1,
  2, 5, 10...?

 I've asked that question previously and the answer was a definitive We
 aren't telling.  It seems to be considered a significant enough trade
 secret that I wouldn't be at all surprised if they are skipping IDs randomly
 to prevent people from doing exactly what you're seeking to do.  Nor would I
 be surprised if they refuse to say a word about it now.

 Short of figuring out an indirect approach, I don't think you'll be able to
 come up with an accurate number.

 Nick


[twitter-dev] Re: Help estimating tweets per day...

2009-10-14 Thread Scott Haneda


And you don't think the streaming API will answer that for you?
--
Scott * If you contact me off list replace talklists@ with scott@ *

On Oct 14, 2009, at 3:27 PM, Kyle B wrote:


Thanks for the info. It helps a lot.  Figuring out an accurate number
is essential to my model, so much so that I am determined to find some
method of estimating it to acceptable margins of error!




[twitter-dev] Re: Help estimating tweets per day...

2009-10-14 Thread Nick Arnett
On Wed, Oct 14, 2009 at 3:27 PM, Kyle B kylebarn...@gmail.com wrote:


 Thanks for the info. It helps a lot.  Figuring out an accurate number
 is essential to my model, so much so that I am determined to find some
 method of estimating it to acceptable margins of error!


It occurs to me that perhaps this might not be so hard... and please do
share your results with us.

Just test a good-sized sample of IDs and see how many don't exist.  That
will give you an idea of how many there really are.  I'll be curious to see
if you get consistent results from one day to the next.  I won't be too
surprised to see if you don't, which would mean that Twitter is skipping a
random (or at least somewhat random) number of IDs each day.

However, if you want to continue to know this number, you'll have to
continue to sample.  And your sample might have to span multiple days to get
a reliable answer.

And I hate to say this, because if they're not already doing it, this might
make them start... Twitter could be monitoring for any process that
repeatedly asks for deliberately non-existent IDs, in order to block them,
to maintain the obfuscation.  Then you're stuck again, unless you can find a
way around that defense.

Assuming there are millions of IDs a day, you'll need a pretty good sample
size if you want to maintain a good number.

The good news in all this is that IIRC, Twitter has guaranteed that IDs will
increase chronologically.

The bad news is that I'm writing this off the top of my head and there's
probably an easy defense I haven't thought of, which somebody at Twitter
will think of just because they see this conversation.

Put 'em on double-secret probation, I say.

Nick


[twitter-dev] Re: Help estimating tweets per day...

2009-10-14 Thread Nick Arnett
On Wed, Oct 14, 2009 at 3:56 PM, Scott Haneda talkli...@newgeo.com wrote:


 And you don't think the streaming API will answer that for you?


It can't, can it?  It isn't the complete stream, only a sampled subset.
There's no way to know which IDs were skipped in order to obfuscate the
actual number of tweets.  A missing ID could either just have not been
sampled or not exist.

Nick


[twitter-dev] Re: Help estimating tweets per day...

2009-10-14 Thread Nick Arnett
On Wed, Oct 14, 2009 at 4:10 PM, Nick Arnett nick.arn...@gmail.com wrote:



 On Wed, Oct 14, 2009 at 3:27 PM, Kyle B kylebarn...@gmail.com wrote:


 Thanks for the info. It helps a lot.  Figuring out an accurate number
 is essential to my model, so much so that I am determined to find some
 method of estimating it to acceptable margins of error!



Couple of more thoughts dawned on me.

If the approach I'm suggesting violates the TOS, please realize that it is
not my intention to encourage anybody to violate the TOS.

Second, thinking more evil-like, one way about the kind of defense I
imagined would be to distribute the problem -- find a bunch of people who
would like the same data and coordinate the testing to see what percentage
of IDs actually exist.

Did I just describe a DDOS?  Please, no.

Another possible evil defense -- there's a fake tweet generator at Twitter,
really messing with the statistics; tweets that are ONLY visible to people
who try to retrieve them via IDs that appear nowhere in public.  A
honey-trap, in other words.

I've spent too much time working with intelligence agencies.

Nick