[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

Matt Sanford Mon, 13 Jul 2009 09:12:19 -0700

Hi there,

    Some comments in-line:


On Jul 13, 2009, at 8:51 AM, owkaye wrote:

First, I wouldn't expect that thousands are going to post
your promo code per minute. That doesn't seem realistic.


Hi John,

It's more than just a promo code.  There are other aspects
of this promotion that might create an issue with thousands
of tweets per minute.  If it happens and I haven't planned
ahead to deal with it, then I'm screwed because some data
will be missing that I really should be retrieving, and
apparently I won't have any way to retrieve it later.

Second, you can use the /track method on the Streaming
API, which will return all keyword matches up to a certain
limit with no other rate limiting.


I guess this is what I need ... unless you or someone can
reduce or eliminate the Search API limits.  It really seems
inappropriate to tie up a connection for streaming data 24
hours a day when I do not need streaming data.

Streaming server connections are quite cheap for Twitter so tying oneup is much less work on the server side than repeated queries.


All I really need is a search that doesn't restrict me so
much.  If I had this capability I could easily minimize my
promotion's impact on Twitter by 2-3 orders of magnitude.
From my perspective this seems like something Twitter might
want to support, but then again I do not work at Twitter so
I'm not as familiar with their priorities as you are.

Contact us if the default limits are an issue.


I'm only guessing that they will become a problem, but it is
very clear to me how easily they might become a problem.

The unfortunate situation here is that *IF* these limits
become a problem it's already too late to do anything about
it -- because by then I've permanently lost access to some
of the data I need -- and even though the data is still in
your database there's no way for me to get it out because
the search restrictions get in the way again.

It's just that the API is so limited that the techniques I
might use with any other service are simply not available at
Twitter.  For example, imagine this which is a far better
scenario for my needs:

I run ONE search every day for my search terms, and Twitter
responds with ALL the matching records no matter how many
there are -- not just 100 per page or 1500 results per
search but ALL matches, even if there are hundreds of
thousands of them.

We tried allowing access to follower information in a one-query methodlike this and it failed. The main reason is that when there are tensof thousands of matches things start timing out. While all matchessounds like a perfect solution, in practice staying connected forminutes at a time and pulling down an unbounded size result set hasnot proved to be a scalable solution.


If this were possible I could easily do only one search per
day and store the results in a local database.  Then the
next day I could run the same search again -- and limit this
new search to the last 24 hours so I don't have to retrieve
any of the same records I retrieved the previous day.

Can you imagine how must LESS this would impact Twitter's
servers when I do not have to keep a connection open 24
hours a day as with Streaming API ... and I do not have to
run repetitive searches every few seconds all day long as
with Search API?  The load savings on your servers would be
huge, not to mention the bandwidth savings!!!

---------------------------------------------------------

The bottom line here is that I hope you have people who
understand this situation and are working to improve it, but
in the meantime my only options appear to be:

1- Use the Streaming API which is clearly an inferior method
for me because a broken connection will cause me to lose
important data without warning.

2- Hope that someone at Twitter can "raise the limits" for
me on their Search API so I can achieve my goals without
running thousands of searches every day.

There is no way for anyone at Twitter to change the pagination limitswithout changing them across the board.

As a side note: The pagination limits exist as a technical limit andnot something meant to stifle creativity/usefulness. When you go backin time we have to read data from disk and replace recent data inmemory with that older data. The pagination limit is there to preventtoo much of our memory space being taken up by old data that a verysmall percentage of requests need.

---------------------------------------------------------

As you can see I'm trying to find the best way to get the
data I need while minimizing the impact on Twitter, that's
why I'm making comments / suggestions like the ones in this
email.

So who should I contact at Twitter to see if they can raise
the search limits for me?  Are you the man?  If not, please
let me know who I should contact and how.

You can email api AT twitter.com for things like this, but as statedabove the pagination limit is not something that has a "white list".The streaming API really is the most scalable solution.


Thanks!

Owkaye


Thanks;
 – Matt Sanford / @mzsanford
     Twitter Dev

[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

Reply via email to