[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-13 Thread owkaye

 First, I wouldn't expect that thousands are going to post
 your promo code per minute. That doesn't seem realistic.

Hi John,

It's more than just a promo code.  There are other aspects 
of this promotion that might create an issue with thousands 
of tweets per minute.  If it happens and I haven't planned 
ahead to deal with it, then I'm screwed because some data 
will be missing that I really should be retrieving, and 
apparently I won't have any way to retrieve it later.


 Second, you can use the /track method on the Streaming
 API, which will return all keyword matches up to a certain
 limit with no other rate limiting. 

I guess this is what I need ... unless you or someone can 
reduce or eliminate the Search API limits.  It really seems 
inappropriate to tie up a connection for streaming data 24 
hours a day when I do not need streaming data.  

All I really need is a search that doesn't restrict me so 
much.  If I had this capability I could easily minimize my 
promotion's impact on Twitter by 2-3 orders of magnitude.  
From my perspective this seems like something Twitter might 
want to support, but then again I do not work at Twitter so 
I'm not as familiar with their priorities as you are.


 Contact us if the default limits are an issue.

I'm only guessing that they will become a problem, but it is 
very clear to me how easily they might become a problem.  

The unfortunate situation here is that *IF* these limits 
become a problem it's already too late to do anything about 
it -- because by then I've permanently lost access to some 
of the data I need -- and even though the data is still in 
your database there's no way for me to get it out because 
the search restrictions get in the way again.

It's just that the API is so limited that the techniques I 
might use with any other service are simply not available at 
Twitter.  For example, imagine this which is a far better 
scenario for my needs:

I run ONE search every day for my search terms, and Twitter 
responds with ALL the matching records no matter how many 
there are -- not just 100 per page or 1500 results per 
search but ALL matches, even if there are hundreds of 
thousands of them.  

If this were possible I could easily do only one search per 
day and store the results in a local database.  Then the 
next day I could run the same search again -- and limit this 
new search to the last 24 hours so I don't have to retrieve 
any of the same records I retrieved the previous day.

Can you imagine how must LESS this would impact Twitter's 
servers when I do not have to keep a connection open 24 
hours a day as with Streaming API ... and I do not have to 
run repetitive searches every few seconds all day long as 
with Search API?  The load savings on your servers would be 
huge, not to mention the bandwidth savings!!!

-

The bottom line here is that I hope you have people who 
understand this situation and are working to improve it, but 
in the meantime my only options appear to be:

1- Use the Streaming API which is clearly an inferior method 
for me because a broken connection will cause me to lose 
important data without warning.

2- Hope that someone at Twitter can raise the limits for 
me on their Search API so I can achieve my goals without 
running thousands of searches every day.

-

As you can see I'm trying to find the best way to get the 
data I need while minimizing the impact on Twitter, that's 
why I'm making comments / suggestions like the ones in this 
email.

So who should I contact at Twitter to see if they can raise 
the search limits for me?  Are you the man?  If not, please 
let me know who I should contact and how.

Thanks!

Owkaye




[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-13 Thread Matt Sanford

Hi there,

Some comments in-line:

On Jul 13, 2009, at 8:51 AM, owkaye wrote:




First, I wouldn't expect that thousands are going to post
your promo code per minute. That doesn't seem realistic.


Hi John,

It's more than just a promo code.  There are other aspects
of this promotion that might create an issue with thousands
of tweets per minute.  If it happens and I haven't planned
ahead to deal with it, then I'm screwed because some data
will be missing that I really should be retrieving, and
apparently I won't have any way to retrieve it later.



Second, you can use the /track method on the Streaming
API, which will return all keyword matches up to a certain
limit with no other rate limiting.


I guess this is what I need ... unless you or someone can
reduce or eliminate the Search API limits.  It really seems
inappropriate to tie up a connection for streaming data 24
hours a day when I do not need streaming data.


Streaming server connections are quite cheap for Twitter so tying one  
up is much less work on the server side than repeated queries.




All I really need is a search that doesn't restrict me so
much.  If I had this capability I could easily minimize my
promotion's impact on Twitter by 2-3 orders of magnitude.
From my perspective this seems like something Twitter might
want to support, but then again I do not work at Twitter so
I'm not as familiar with their priorities as you are.



Contact us if the default limits are an issue.


I'm only guessing that they will become a problem, but it is
very clear to me how easily they might become a problem.

The unfortunate situation here is that *IF* these limits
become a problem it's already too late to do anything about
it -- because by then I've permanently lost access to some
of the data I need -- and even though the data is still in
your database there's no way for me to get it out because
the search restrictions get in the way again.

It's just that the API is so limited that the techniques I
might use with any other service are simply not available at
Twitter.  For example, imagine this which is a far better
scenario for my needs:

I run ONE search every day for my search terms, and Twitter
responds with ALL the matching records no matter how many
there are -- not just 100 per page or 1500 results per
search but ALL matches, even if there are hundreds of
thousands of them.


We tried allowing access to follower information in a one-query method  
like this and it failed. The main reason is that when there are tens  
of thousands of matches things start timing out. While all matches  
sounds like a perfect solution, in practice staying connected for  
minutes at a time and pulling down an unbounded size result set has  
not proved to be a scalable solution.




If this were possible I could easily do only one search per
day and store the results in a local database.  Then the
next day I could run the same search again -- and limit this
new search to the last 24 hours so I don't have to retrieve
any of the same records I retrieved the previous day.

Can you imagine how must LESS this would impact Twitter's
servers when I do not have to keep a connection open 24
hours a day as with Streaming API ... and I do not have to
run repetitive searches every few seconds all day long as
with Search API?  The load savings on your servers would be
huge, not to mention the bandwidth savings!!!

-

The bottom line here is that I hope you have people who
understand this situation and are working to improve it, but
in the meantime my only options appear to be:

1- Use the Streaming API which is clearly an inferior method
for me because a broken connection will cause me to lose
important data without warning.

2- Hope that someone at Twitter can raise the limits for
me on their Search API so I can achieve my goals without
running thousands of searches every day.


There is no way for anyone at Twitter to change the pagination limits  
without changing them across the board.


As a side note: The pagination limits exist as a technical limit and  
not something meant to stifle creativity/usefulness. When you go back  
in time we have to read data from disk and replace recent data in  
memory with that older data. The pagination limit is there to prevent  
too much of our memory space being taken up by old data that a very  
small percentage of requests need.



-

As you can see I'm trying to find the best way to get the
data I need while minimizing the impact on Twitter, that's
why I'm making comments / suggestions like the ones in this
email.

So who should I contact at Twitter to see if they can raise
the search limits for me?  Are you the man?  If not, please
let me know who I should contact and how.


You can email api AT twitter.com for things like this, but as stated  
above the pagination limit is not something that has a white list.  

[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-13 Thread John Kalucki

I concur with Matt.

Track in the Streaming API is, in part, intended for applications just
like yours. Hit the Search API and use track together to get the
highest proportion of statuses possible. The default track limit is
intended for human readable scale applications. Email me about
elevated track access for services.

It's possible that you are worrying about an unlikely event. Sustained
single topic statuses in the thousands per minute are usually limited
to things like massive social upheaval, big political events,
celebrity death, etc.

-John Kalucki
twitter.com/jkalucki
Services, Twitter Inc.




On Jul 13, 9:12 am, Matt Sanford m...@twitter.com wrote:
 Hi there,

      Some comments in-line:

 On Jul 13, 2009, at 8:51 AM, owkaye wrote:





  First, I wouldn't expect that thousands are going to post
  your promo code per minute. That doesn't seem realistic.

  Hi John,

  It's more than just a promo code.  There are other aspects
  of this promotion that might create an issue with thousands
  of tweets per minute.  If it happens and I haven't planned
  ahead to deal with it, then I'm screwed because some data
  will be missing that I really should be retrieving, and
  apparently I won't have any way to retrieve it later.

  Second, you can use the /track method on the Streaming
  API, which will return all keyword matches up to a certain
  limit with no other rate limiting.

  I guess this is what I need ... unless you or someone can
  reduce or eliminate the Search API limits.  It really seems
  inappropriate to tie up a connection for streaming data 24
  hours a day when I do not need streaming data.

 Streaming server connections are quite cheap for Twitter so tying one  
 up is much less work on the server side than repeated queries.





  All I really need is a search that doesn't restrict me so
  much.  If I had this capability I could easily minimize my
  promotion's impact on Twitter by 2-3 orders of magnitude.
  From my perspective this seems like something Twitter might
  want to support, but then again I do not work at Twitter so
  I'm not as familiar with their priorities as you are.

  Contact us if the default limits are an issue.

  I'm only guessing that they will become a problem, but it is
  very clear to me how easily they might become a problem.

  The unfortunate situation here is that *IF* these limits
  become a problem it's already too late to do anything about
  it -- because by then I've permanently lost access to some
  of the data I need -- and even though the data is still in
  your database there's no way for me to get it out because
  the search restrictions get in the way again.

  It's just that the API is so limited that the techniques I
  might use with any other service are simply not available at
  Twitter.  For example, imagine this which is a far better
  scenario for my needs:

  I run ONE search every day for my search terms, and Twitter
  responds with ALL the matching records no matter how many
  there are -- not just 100 per page or 1500 results per
  search but ALL matches, even if there are hundreds of
  thousands of them.

 We tried allowing access to follower information in a one-query method  
 like this and it failed. The main reason is that when there are tens  
 of thousands of matches things start timing out. While all matches  
 sounds like a perfect solution, in practice staying connected for  
 minutes at a time and pulling down an unbounded size result set has  
 not proved to be a scalable solution.





  If this were possible I could easily do only one search per
  day and store the results in a local database.  Then the
  next day I could run the same search again -- and limit this
  new search to the last 24 hours so I don't have to retrieve
  any of the same records I retrieved the previous day.

  Can you imagine how must LESS this would impact Twitter's
  servers when I do not have to keep a connection open 24
  hours a day as with Streaming API ... and I do not have to
  run repetitive searches every few seconds all day long as
  with Search API?  The load savings on your servers would be
  huge, not to mention the bandwidth savings!!!

  -

  The bottom line here is that I hope you have people who
  understand this situation and are working to improve it, but
  in the meantime my only options appear to be:

  1- Use the Streaming API which is clearly an inferior method
  for me because a broken connection will cause me to lose
  important data without warning.

  2- Hope that someone at Twitter can raise the limits for
  me on their Search API so I can achieve my goals without
  running thousands of searches every day.

 There is no way for anyone at Twitter to change the pagination limits  
 without changing them across the board.

 As a side note: The pagination limits exist as a technical limit and  
 not something meant to stifle creativity/usefulness. When you go back 

[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-13 Thread owkaye

 We tried allowing access to follower information in a
 one-query method like this and it failed. The main reason
 is that when there are tens of thousands of matches
 things start timing out. While all matches sounds like a
 perfect solution, in practice staying connected for
 minutes at a time and pulling down an unbounded size
 result set has not proved to be a scalable solution.

Maybe a different data system would allow this capability.  
But you have the system you have so I understand why you've 
done what you've done.


 There is no way for anyone at Twitter to change the
 pagination limits without changing them across the board.

This is too bad.  Are you working on changing this in the 
future or is this going to be a limitation that persists for 
years to come?


 As a side note: The pagination limits exist as a
 technical limit and not something meant to stifle
 creativity/usefulness. When you go back in time we have
 to read data from disk and replace recent data in memory
 with that older data. The pagination limit is there to
 prevent too much of our memory space being taken up by
 old data that a very small percentage of requests need.

Okay, this makes sense.  It sounds like the original system 
designers never gave much consideration to the value of 
historical data search and retrieval.  Too bad there's 
nothing that can be done about this right now, but maybe in 
the future ... ?


 The streaming API really is the most scalable solution.

No doubt.  It's disappointing that my software probably 
cannot handle streaming data too, but that's my problem not 
yours.  

Does anyone have sample PHP code that successfully uses the 
twitter Streaming API to retrieve the stream and write it to 
a file or database?  I hate PHP but if it works then that's 
what I'll use, especially if some helpful soul can post some 
code to help me get started.  Thanks.


Owkaye




[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-13 Thread owkaye

 I concur with Matt.

 Track in the Streaming API is, in part, intended for
 applications just like yours. Hit the Search API and use
 track together to get the highest proportion of statuses
 possible. The default track limit is intended for human
 readable scale applications. Email me about elevated
 track access for services.

I would use the Streaming API if I could, but now the 
problem is that my server side scripting language probably 
won't be able to use the Streaming API successfully ...

My software hasn't been upgraded in years, and when it was 
first coded streaming data via http didn't even exist.  The 
software has been upgraded once in a while over the past 
decade or so, but the last significant upgrade was more than 
5 years ago and it didn't have anything added to allow 
streaming data access at that time, so I doubt it can handle 
this task now.  

I have an email request in to the current owners but I doubt 
they know how it works either.  They never coded the 
original software or any of the upgrades.  They just bought 
the software without possessing the expertise to understand 
the code, so they really don't know how it works internally 
either.

My best guess is that it cannot write streaming data to a 
database as that data is transmitted, and that's what it 
needs it to do if I have any chance of using the Streaming 
API instead of a search.   So I'll probably have to use some 
other software to accomplish this task.  

Any suggestions which software I should use to make this as 
fast and easy to code as possible?


 It's possible that you are worrying about an unlikely
 event. Sustained single topic statuses in the thousands
 per minute are usually limited to things like massive
 social upheaval, big political events, celebrity death,
 etc.

You may be correct, but to plan for the possibility that 
this may be bigger than expected is simply the way I do 
business.  It doesn't make sense for me to launch a promo 
like this until I'm prepared for the possibilities, right?


Owkaye






[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-09 Thread Chad Etzel

Yep, you gotta do 15 requests at 100 rpp each.
-Chad

On Thu, Jul 9, 2009 at 5:45 PM, owkayeowk...@gmail.com wrote:

 I'm building an app that uses the atom search API to retrieve recent
 posts which contain a specific keyword.  The API docs say:

 Clients may request up to 1,500 statuses via the page and rpp
 parameters for the search method.

 But this 1500 hits per search cannot be done in a single request
 because of the rpp limit.  Instead I have to perform 15 sequential
 requests in order to get only 100 items returned on each page ... for
 a total of 1500 items.

 This is certainly a good way to increase the server load, since 15
 connections at 100 results each takes far more server resources than 1
 connection returring all 1500 results.  Therefore I'm wondering if I'm
 misunderstanding something here, or if this is really the only way I
 can get the maximum of 1500 items via atom search?



[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-09 Thread owkaye

Thanks Chad, that's what I was afraid of.  I wonder if you
know about this next question:

Twitter API docs say search is rate limited to something
more than REST which is 150 requests per hour, but for the
sake of argument let's say the search rate limit is actually
150 hits per hour ...

Since I have to do 15 consecutive searches to make sure I've
retrieved the last 1500 matching items, does this mean I can
only do 10 sets of 15 searches per hour = 150 request per
hour?

If so, this is only one set of searches every 6 minutes, and
it seems to me that on a trending topic there might be lots
more than 1500 new tweets every 6 minutes.

How can I get around this limit?

I'm not trying to hurt Twitter, but business applications that
require ALL tweets to be recorded cannot deal with these
types of limitations on a practical basis, and if Twitter doesn't
come up with a better way I can see this hindering its future
revenue streams from businesses like mine that want to build
on a solid and easy-to-use foundation.

So getting back to my question of What do I do now ...

Do I have to put my automated search code on a bunch of
separate servers so the IP's are spread around -- such that
none of them hit the limit of 150 searches per hour?

Seems to me that this is the only realistic way to insure
that I can always retrieve all the matching results I need
without hitting the API limits ... but if you or others have
a better suggestion please let me know, thanks.


On Jul 9, 5:52 pm, Chad Etzel jazzyc...@gmail.com wrote:
 Yep, you gotta do 15 requests at 100 rpp each.
 -Chad

 On Thu, Jul 9, 2009 at 5:45 PM, owkayeowk...@gmail.com wrote:

  I'm building an app that uses the atom search API to retrieve recent
  posts which contain a specific keyword.  The API docs say:

  Clients may request up to 1,500 statuses via the page and rpp
  parameters for the search method.

  But this 1500 hits per search cannot be done in a single request
  because of the rpp limit.  Instead I have to perform 15 sequential
  requests in order to get only 100 items returned on each page ... for
  a total of 1500 items.

  This is certainly a good way to increase the server load, since 15
  connections at 100 results each takes far more server resources than 1
  connection returring all 1500 results.  Therefore I'm wondering if I'm
  misunderstanding something here, or if this is really the only way I
  can get the maximum of 1500 items via atom search?


[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-09 Thread Scott Haneda


You are correct, you have to do 15 requests.  However, you can cache  
the results in your end, so when you come back, you are only getting  
the new stuff.


Twitter has pretty good date handling, so you specify your last date,  
and pull forward from there.  You may even be able to get the last id  
of the last tweet you pulled, and just tell it to get you all the new  
ones.


On Jul 9, 2009, at 2:45 PM, owkaye wrote:


I'm building an app that uses the atom search API to retrieve recent
posts which contain a specific keyword.  The API docs say:

Clients may request up to 1,500 statuses via the page and rpp
parameters for the search method.

But this 1500 hits per search cannot be done in a single request
because of the rpp limit.  Instead I have to perform 15 sequential
requests in order to get only 100 items returned on each page ... for
a total of 1500 items.

This is certainly a good way to increase the server load, since 15
connections at 100 results each takes far more server resources than 1
connection returring all 1500 results.  Therefore I'm wondering if I'm
misunderstanding something here, or if this is really the only way I
can get the maximum of 1500 items via atom search?


--
Scott * If you contact me off list replace talklists@ with scott@ *



[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-09 Thread owkaye

 You are correct, you have to do 15 requests.  However,
 you can cache the results in your end, so when you come
 back, you are only getting the new stuff.

Thanks Scott.  I'm storing the results in a database on my server but
that doesn't stop the search from retrieving the same results
repetitively, because the search string/terms are still the same.

My problem is going to occur when thousands of people start tweeting
my promo codes every minute and I'm not able to retrieve all those
tweets because of the search API limitations.

If I'm limited to retrieving 1500 tweets every 6 minutes and people
post 1000 tweets every minute I need some way of retrieving the
missing 4500 tweets -- but apparently Twitter doesn't offer anything
even remotely close to this capability -- so I can see where it has a
long way to go before it's ready to support the kind of search
capabilities I need.


 Twitter has pretty good date handling, so you specify
 your last date, and pull forward from there.  You may
 even be able to get the last id of the last tweet you
 pulled, and just tell it to get you all the new ones.

Yep, that's what I'm doing ... pulling from the records I haven't
already retrieved based on the since_id value.

But when the new tweets total more than 1500 in a short time, the
excess tweets will get lost and there's no way to retrieve them --
unless I run my searches from multiple servers to avoid Twitter's ip
address limits -- and doing this would be a real kludge that I'm not
tempted to bother with.


  I'm building an app that uses the atom search API to retrieve recent
  posts which contain a specific keyword.  The API docs say:

  Clients may request up to 1,500 statuses via the page and rpp
  parameters for the search method.

  But this 1500 hits per search cannot be done in a single request
  because of the rpp limit.  Instead I have to perform 15 sequential
  requests in order to get only 100 items returned on each page ... for
  a total of 1500 items.

  This is certainly a good way to increase the server load, since 15
  connections at 100 results each takes far more server resources than 1
  connection returring all 1500 results.  Therefore I'm wondering if I'm
  misunderstanding something here, or if this is really the only way I
  can get the maximum of 1500 items via atom search?

 --
 Scott * If you contact me off list replace talklists@ with scott@ *


[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-09 Thread Scott Haneda



You are correct, you have to do 15 requests.  However,
you can cache the results in your end, so when you come
back, you are only getting the new stuff.


Thanks Scott.  I'm storing the results in a database on my server but
that doesn't stop the search from retrieving the same results
repetitively, because the search string/terms are still the same.

My problem is going to occur when thousands of people start tweeting
my promo codes every minute and I'm not able to retrieve all those
tweets because of the search API limitations.

If I'm limited to retrieving 1500 tweets every 6 minutes and people
post 1000 tweets every minute I need some way of retrieving the
missing 4500 tweets -- but apparently Twitter doesn't offer anything
even remotely close to this capability -- so I can see where it has a
long way to go before it's ready to support the kind of search
capabilities I need.



Have you read this:
http://apiwiki.twitter.com/Rate-limiting

Section on search rate limiting.  I do not believe there is a rate  
limit on search.  I am sure there is a very high limit, but you should  
be ok.  I would also suggest you get whitelisted, that bumps you up to  
20,000 request per hour I believe.


Further, you could make the users o-auth, in which case, the hit  
counts to them, not you.


You will still get stuck with the fact that search is a pretty broad  
thing, and not exacting, but you should be able to complete your  
searches in the amount you want.  I have seen apps that I know are  
using search on their back end, and they are not requiring auth by the  
user, and they are further pretty heavily hit.

--
Scott * If you contact me off list replace talklists@ with scott@ *



[twitter-dev] Re: How to insure that all tweets are retrieved in a search?

2009-07-09 Thread David Fisher

Err, the Search API isn't limited to 150 requests per hour. It's much
higher than that. Much, but not unlimited.

As John said, read into the Search API more, and check into the
Streaming API as well.

It is certainly possible to get more than 1,500 results for a term,
but not by using simple paging. I've been able to pull 2M+ results for
a query before. It took 8 hours or so, but it worked. Read up more on
the Search API and you should be able to figure it out.

-David Fisher
http://WebecologyProject.org

On Jul 9, 8:16 pm, John Kalucki jkalu...@gmail.com wrote:
 First, I wouldn't expect that thousands are going to post your promo
 code
 per minute. That doesn't seem realistic.

 Second, in addition to the Search API, which is quite liberal, you can
 use
 the /track method on the Streaming API, which will return all keyword
 matches up to a certain limit with no other rate limiting. Contact us
 if the
 default limits are an issue.

 -John Kalucki
 Services, Twitter Inc.

 On Jul 9, 3:51 pm, owkaye owk...@gmail.com wrote:



   You are correct, you have to do 15 requests.  However,
   you can cache the results in your end, so when you come
   back, you are only getting the new stuff.

  Thanks Scott.  I'm storing the results in a database on my server but
  that doesn't stop the search from retrieving the same results
  repetitively, because the search string/terms are still the same.

  My problem is going to occur when thousands of people start tweeting
  my promo codes every minute and I'm not able to retrieve all those
  tweets because of the search API limitations.

  If I'm limited to retrieving 1500 tweets every 6 minutes and people
  post 1000 tweets every minute I need some way of retrieving the
  missing 4500 tweets -- but apparently Twitter doesn't offer anything
  even remotely close to this capability -- so I can see where it has a
  long way to go before it's ready to support the kind of search
  capabilities I need.

   Twitter has pretty good date handling, so you specify
   your last date, and pull forward from there.  You may
   even be able to get the last id of the last tweet you
   pulled, and just tell it to get you all the new ones.

  Yep, that's what I'm doing ... pulling from the records I haven't
  already retrieved based on the since_id value.

  But when the new tweets total more than 1500 in a short time, the
  excess tweets will get lost and there's no way to retrieve them --
  unless I run my searches from multiple servers to avoid Twitter's ip
  address limits -- and doing this would be a real kludge that I'm not
  tempted to bother with.

I'm building an app that uses the atom search API to retrieve recent
posts which contain a specific keyword.  The API docs say:

Clients may request up to 1,500 statuses via the page and rpp
parameters for the search method.

But this 1500 hits per search cannot be done in a single request
because of the rpp limit.  Instead I have to perform 15 sequential
requests in order to get only 100 items returned on each page ... for
a total of 1500 items.

This is certainly a good way to increase the server load, since 15
connections at 100 results each takes far more server resources than 1
connection returring all 1500 results.  Therefore I'm wondering if I'm
misunderstanding something here, or if this is really the only way I
can get the maximum of 1500 items via atom search?

   --
   Scott * If you contact me off list replace talklists@ with scott@ *