[twitter-dev] Re: Search result pagination bugs

2009-04-16 Thread stevenic

Thanks for the reply Matt...

Just as an FYI...

I updated my code to track duplicates and then did a sample run over a
5 minute period that once a minute paged in new results for the query
http filter:links  This resulted in about 11 pages of results each
minute and over the 11 pages I saw anywhere from 60 - 150 duplicates
so it's not just 3 or 4.  My concern isn't really around the extra
updates it's the fact that sometimes updates are missing.

Anyway... It sounds like you guys are working on it and I just thought
I'd share that data point with you.

-steve


[twitter-dev] Re: Search result pagination bugs

2009-04-16 Thread Chad Etzel

the query http filter:links (which is a bit redundant) is such a
high volume query that I would doubt that the search servers would
ever be able to keep in sync even when things were running up to
speed.

Try with a less traffic'd query like twitter

-Chad

On Thu, Apr 16, 2009 at 6:55 PM, stevenic ick...@gmail.com wrote:

 Thanks for the reply Matt...

 Just as an FYI...

 I updated my code to track duplicates and then did a sample run over a
 5 minute period that once a minute paged in new results for the query
 http filter:links  This resulted in about 11 pages of results each
 minute and over the 11 pages I saw anywhere from 60 - 150 duplicates
 so it's not just 3 or 4.  My concern isn't really around the extra
 updates it's the fact that sometimes updates are missing.

 Anyway... It sounds like you guys are working on it and I just thought
 I'd share that data point with you.

 -steve



[twitter-dev] Re: Search result pagination bugs

2009-04-16 Thread stevenic

So my project is a sort of tweetmeme or twitturly type thing where I'm
looking to collect a sample of the links being shared through
Twitter.  Unlike those projects I don't have a firehose so I have to
rely on search.  Fortunatly, I don't really need to see every link for
my project just a representive sample.

The actual query I'm using is http OR www filter:links where the
filter:links constraint helps make sure I exclude tweets like can't
get http GET to work  I don't really care about those.

Agreed with this query being a high volume query so maybe it'll never
be in sync but that's ok... Now I'm just ignoring the dupes.  And to
be clear, I have no intention of trying to keep up and use search as a
poor mans firehose.  What ever rate you guys are comfortable with me
hitting you at is what I'll do.  If that's one request/minute so be
it.  Just wanted to get the pagenation working so that I could better
control things and that's when I noticed the dupes.

-steve
(Microsoft Research)


[twitter-dev] Re: Search result pagination bugs

2009-04-16 Thread Chad Etzel

I can't speak for twitter on the permission to do that side, but
that technique will work just fine, so you should be good to go
technically.
-chad

On Thu, Apr 16, 2009 at 9:34 PM, stevenic ick...@gmail.com wrote:

 Matt...  Another thought I just had...

 As Chad points out, with my particular query being high volume its
 realistic to think that I'm always going to risk seeing duplicates if
 I try to query for results in real time due to replication lag between
 your servers.  But I see how your using max_id in the paging stuff and
 I don't really need real time results so it seems like I should be
 able to use an ID that's 30 - 60 minutes old and do all of my queries
 using max_id instead of since_id.  In theory this would have me
 trailing the edge of new results coming into the index by 30 - 60
 minutes but it would give the servers more time to replicate so it
 seems like there'd be less of a chance I'd encounter dupes or missing
 entries.

 If that approach would work (and you would know) I'd just want to make
 sure you'd be ok with me using max_id instead of since_id given that
 max_id isn't documented

 -steve

 On Apr 16, 7:58 am, Matt Sanford m...@twitter.com wrote:
 Hi all,

     There was a problem yesterday with several of the search back-ends
 falling behind. This meant that if your page=1 and page=2 queries hit
 different hosts they could return results that don't line up. If your
 page=2 query hit a host with more lag you would miss results, and if
 it hit a host that was more up-to-date you would see duplicates. We're
 working on fixing this issues and trying to find a way to prevent
 incorrect pagination in the future. Sorry for the delay in replying
 but I was focusing all of my attention on fixing the issue and had to
 let email wait.

 Thanks;
    — Matt Sanford / @mzsanford

 On Apr 15, 2009, at 09:29 PM, stevenic wrote:





  Ok... So I think I know what's going on.  Well I don't know what's
  causing the bug obviously but I think I've narrowed down where it
  is...

  I just issued the Page 1 or previous query for the above example and
  the ID's don't match the ID's from the original query.  There are
  extra rows that come back... 3 to be exact.  So the pagination queries
  are working fine.  It's the initial query that's busted.  It looks
  like that when you do a pagenation query you get back all rows
  matching the filter but a query without max_id sometimes drops rows.
  Well in my case it seems to drop rows everytime... This should get
  fixed...

  *
  for:  http://search.twitter.com/search.atom?max_id=1530963910page=1q=http

  feed xmlns:google=http://base.google.com/ns/1.0; xml:lang=en-US
  xmlns:openSearch=http://a9.com/-/spec/opensearch/1.1/; xmlns=http://
 www.w3.org/2005/Atom xmlns:twitter=http://api.twitter.com/;
   link type=application/atom+xml rel=self href=http://
  search.twitter.com/search.atom?max_id=1530963910page=1q=http /
   twitter:warningadjusted since_id, it was older than allowed/
  twitter:warning
   updated2009-04-16T03:25:30Z/updated
   openSearch:itemsPerPage15/openSearch:itemsPerPage
   openSearch:languageen/openSearch:language
   link type=application/atom+xml rel=next href=http://
  search.twitter.com/search.atom?max_id=1530963910page=2q=http /

    ...Removed...

  entry
   idtag:search.twitter.com,2005:1530963910/id
   published2009-04-16T03:25:30Z/published
  /entry
  entry
   idtag:search.twitter.com,2005:1530963908/id
   published2009-04-16T03:25:32Z/published

   ...Where Did This Come From?...

  /entry
  entry
   idtag:search.twitter.com,2005:1530963898/id
   published2009-04-16T03:25:30Z/published

   ...And This?...

  /entry
   idtag:search.twitter.com,2005:1530963896/id
   idtag:search.twitter.com,2005:1530963895/id
   idtag:search.twitter.com,2005:1530963894/id
  entry
   idtag:search.twitter.com,2005:1530963892/id
   published2009-04-16T03:25:32Z/published

   ...And This?...

  /entry
   idtag:search.twitter.com,2005:1530963881/id
   idtag:search.twitter.com,2005:1530963865/id
   idtag:search.twitter.com,2005:1530963860/id
   idtag:search.twitter.com,2005:1530963834/id
   idtag:search.twitter.com,2005:1530963833/id
   idtag:search.twitter.com,2005:1530963829/id
   idtag:search.twitter.com,2005:1530963827/id
   idtag:search.twitter.com,2005:1530963812/id
  /feed- Hide quoted text -

 - Show quoted text -



[twitter-dev] Re: Search result pagination bugs

2009-04-15 Thread Chad Etzel

It would be helpful if you could give some example output/results
where you are seeing duplicates across pages.  I have spent a long
long time with the Search API and haven't ever had this problem (or
maybe I have and never noticed it).

-Chad

On Wed, Apr 15, 2009 at 9:07 PM, steve ick...@gmail.com wrote:

 I've been using the Search API in a project and its been working very
 reliably.  So today I decided to add support for pagination so I could
 pull in more results and I think I've identified a couple of bugs with
 the pagination code.

 Bug 1)

 The first few results of Page 2 for a query are sometimes duplicates.
 To verify this do the following:

   1. Execute the query: 
 http://search.twitter.com/search.atom?lang=enq=httprpp=100
   2. Grab the next link from the results and execute that.
   3. Compare the ID's at the end of set one with the ID's at the
 begining of set 2.  They sometimes overlap.


 Bug 2)

 The second bug may be the cause of the 1st bug.  The link you get for
 next in a result set is missing the lang=en query param.  So you
 end up getting non-english items in your result set.  You can manually
 add the lang=en param to your query and while you still get dupes
 you get less.  If you do this though you then start getting a warning
 in the result set about an adjusted since_id.

 What's scarier though is that the result set seemed to get wierd on me
 if I added the lang param and requested pages too fast.  By that I
 mean I would sometimes get results for Page 2 that were (time wise)
 hours before my original Since ID so my code would just stop
 requesting pages since it assumed it had reached the end of the set.
 The scary part... Adding around a 2 seconds sleep between queries
 seemed to make this issue go away...


 In general the pagination stuff with the next link doesn't seem very
 reliable to me.  You do seem to get less dupes then just calling
 search and incrementing the page number.  But I'm still seeing dupes,
 results for the wrong language, and sometimes totally wierd results.

 -steve



[twitter-dev] Re: Search result pagination bugs

2009-04-15 Thread stevenic

Sure...  It repros for me every time in IE using the steps I outlined
above.  Do a query for lang=enq=http.  Open the next link in a
new tab of your browser and compare the ID's.

So I just did this from my home PC and here's the condensed output.
Notice that on Page 2 not only do I get 3 dupes but I even get a
result that should have been on Page 1... I hadn't seen that one
before but I'll assume that maybe a different server serviced each
request and they're not synced.


*
for: http://search.twitter.com/search.atom?lang=enq=http


feed xmlns:google=http://base.google.com/ns/1.0; xml:lang=en-US
xmlns:openSearch=http://a9.com/-/spec/opensearch/1.1/; xmlns=http://
www.w3.org/2005/Atom xmlns:twitter=http://api.twitter.com/;
  link type=application/atom+xml rel=self href=http://
search.twitter.com/search.atom?lang=enq=http /
  twitter:warningadjusted since_id, it was older than allowed/
twitter:warning
  updated2009-04-16T03:25:30Z/updated
  openSearch:itemsPerPage15/openSearch:itemsPerPage
  openSearch:languageen/openSearch:language
  link type=application/atom+xml rel=next href=http://
search.twitter.com/search.atom?max_id=1530963910page=2q=http /

  ...removed...

entry
  idtag:search.twitter.com,2005:1530963910/id
  published2009-04-16T03:25:30Z/published

  ...removed...

/entry
entry
  idtag:search.twitter.com,2005:1530963896/id
/entry
  idtag:search.twitter.com,2005:1530963895/id
  idtag:search.twitter.com,2005:1530963894/id
  idtag:search.twitter.com,2005:1530963881/id
  idtag:search.twitter.com,2005:1530963865/id
  idtag:search.twitter.com,2005:1530963860/id
  idtag:search.twitter.com,2005:1530963834/id
  idtag:search.twitter.com,2005:1530963833/id
  idtag:search.twitter.com,2005:1530963829/id
  idtag:search.twitter.com,2005:1530963827/id
  idtag:search.twitter.com,2005:1530963812/id
  idtag:search.twitter.com,2005:1530963811/id
  idtag:search.twitter.com,2005:1530963796/id
  idtag:search.twitter.com,2005:1530963786/id
/feed


*
for:  http://search.twitter.com/search.atom?max_id=1530963910page=2q=http

feed xmlns:google=http://base.google.com/ns/1.0; xml:lang=en-US
xmlns:openSearch=http://a9.com/-/spec/opensearch/1.1/; xmlns=http://
www.w3.org/2005/Atom xmlns:twitter=http://api.twitter.com/;
  link type=application/atom+xml rel=self href=http://
search.twitter.com/search.atom?max_id=1530963910page=2q=http /
  updated2009-04-16T03:25:31Z/updated
  openSearch:itemsPerPage15/openSearch:itemsPerPage
  openSearch:languageen/openSearch:language
  link type=application/atom+xml rel=previous href=http://
search.twitter.com/search.atom?max_id=1530963910page=1q=http /
  link type=application/atom+xml rel=next href=http://
search.twitter.com/search.atom?max_id=1530963910page=3q=http /

   ...Removed...

entry
  idtag:search.twitter.com,2005:1530963811/id
  published2009-04-16T03:25:31Z/published

   ...Duplicate 1...

/entry
entry
  idtag:search.twitter.com,2005:1530963803/id
  published2009-04-16T03:25:29Z/published
  twitter:langen/twitter:lang

   ...Not Even In Previous Page...

/entry
entry
  idtag:search.twitter.com,2005:1530963796/id
  published2009-04-16T03:25:29Z/published

   ...Duplicate 2...

/entry
entry
  idtag:search.twitter.com,2005:1530963786/id
  published2009-04-16T03:25:31Z/published

   ...Duplicate 3...

/entry
entry
  idtag:search.twitter.com,2005:1530963777/id

   ...First New Result (save the one above)...

/entry
  idtag:search.twitter.com,2005:1530963755/id
  idtag:search.twitter.com,2005:1530963732/id
  idtag:search.twitter.com,2005:1530963725/id
  idtag:search.twitter.com,2005:1530963718/id
  idtag:search.twitter.com,2005:1530963710/id
  idtag:search.twitter.com,2005:1530963709/id
  idtag:search.twitter.com,2005:1530963706/id
  idtag:search.twitter.com,2005:1530963699/id
  idtag:search.twitter.com,2005:1530963698/id
  idtag:search.twitter.com,2005:1530963690/id
/feed


[twitter-dev] Re: Search result pagination bugs

2009-04-15 Thread stevenic

Ok... So I think I know what's going on.  Well I don't know what's
causing the bug obviously but I think I've narrowed down where it
is...

I just issued the Page 1 or previous query for the above example and
the ID's don't match the ID's from the original query.  There are
extra rows that come back... 3 to be exact.  So the pagination queries
are working fine.  It's the initial query that's busted.  It looks
like that when you do a pagenation query you get back all rows
matching the filter but a query without max_id sometimes drops rows.
Well in my case it seems to drop rows everytime... This should get
fixed...


*
for:  http://search.twitter.com/search.atom?max_id=1530963910page=1q=http

feed xmlns:google=http://base.google.com/ns/1.0; xml:lang=en-US
xmlns:openSearch=http://a9.com/-/spec/opensearch/1.1/; xmlns=http://
www.w3.org/2005/Atom xmlns:twitter=http://api.twitter.com/;
  link type=application/atom+xml rel=self href=http://
search.twitter.com/search.atom?max_id=1530963910page=1q=http /
  twitter:warningadjusted since_id, it was older than allowed/
twitter:warning
  updated2009-04-16T03:25:30Z/updated
  openSearch:itemsPerPage15/openSearch:itemsPerPage
  openSearch:languageen/openSearch:language
  link type=application/atom+xml rel=next href=http://
search.twitter.com/search.atom?max_id=1530963910page=2q=http /

   ...Removed...

entry
  idtag:search.twitter.com,2005:1530963910/id
  published2009-04-16T03:25:30Z/published
/entry
entry
  idtag:search.twitter.com,2005:1530963908/id
  published2009-04-16T03:25:32Z/published

  ...Where Did This Come From?...

/entry
entry
  idtag:search.twitter.com,2005:1530963898/id
  published2009-04-16T03:25:30Z/published

  ...And This?...

/entry
  idtag:search.twitter.com,2005:1530963896/id
  idtag:search.twitter.com,2005:1530963895/id
  idtag:search.twitter.com,2005:1530963894/id
entry
  idtag:search.twitter.com,2005:1530963892/id
  published2009-04-16T03:25:32Z/published

  ...And This?...

/entry
  idtag:search.twitter.com,2005:1530963881/id
  idtag:search.twitter.com,2005:1530963865/id
  idtag:search.twitter.com,2005:1530963860/id
  idtag:search.twitter.com,2005:1530963834/id
  idtag:search.twitter.com,2005:1530963833/id
  idtag:search.twitter.com,2005:1530963829/id
  idtag:search.twitter.com,2005:1530963827/id
  idtag:search.twitter.com,2005:1530963812/id
/feed