[twitter-dev] Re: Search result pagination bugs
Thanks for the reply Matt... Just as an FYI... I updated my code to track duplicates and then did a sample run over a 5 minute period that once a minute paged in new results for the query http filter:links This resulted in about 11 pages of results each minute and over the 11 pages I saw anywhere from 60 - 150 duplicates so it's not just 3 or 4. My concern isn't really around the extra updates it's the fact that sometimes updates are missing. Anyway... It sounds like you guys are working on it and I just thought I'd share that data point with you. -steve
[twitter-dev] Re: Search result pagination bugs
the query http filter:links (which is a bit redundant) is such a high volume query that I would doubt that the search servers would ever be able to keep in sync even when things were running up to speed. Try with a less traffic'd query like twitter -Chad On Thu, Apr 16, 2009 at 6:55 PM, stevenic ick...@gmail.com wrote: Thanks for the reply Matt... Just as an FYI... I updated my code to track duplicates and then did a sample run over a 5 minute period that once a minute paged in new results for the query http filter:links This resulted in about 11 pages of results each minute and over the 11 pages I saw anywhere from 60 - 150 duplicates so it's not just 3 or 4. My concern isn't really around the extra updates it's the fact that sometimes updates are missing. Anyway... It sounds like you guys are working on it and I just thought I'd share that data point with you. -steve
[twitter-dev] Re: Search result pagination bugs
So my project is a sort of tweetmeme or twitturly type thing where I'm looking to collect a sample of the links being shared through Twitter. Unlike those projects I don't have a firehose so I have to rely on search. Fortunatly, I don't really need to see every link for my project just a representive sample. The actual query I'm using is http OR www filter:links where the filter:links constraint helps make sure I exclude tweets like can't get http GET to work I don't really care about those. Agreed with this query being a high volume query so maybe it'll never be in sync but that's ok... Now I'm just ignoring the dupes. And to be clear, I have no intention of trying to keep up and use search as a poor mans firehose. What ever rate you guys are comfortable with me hitting you at is what I'll do. If that's one request/minute so be it. Just wanted to get the pagenation working so that I could better control things and that's when I noticed the dupes. -steve (Microsoft Research)
[twitter-dev] Re: Search result pagination bugs
I can't speak for twitter on the permission to do that side, but that technique will work just fine, so you should be good to go technically. -chad On Thu, Apr 16, 2009 at 9:34 PM, stevenic ick...@gmail.com wrote: Matt... Another thought I just had... As Chad points out, with my particular query being high volume its realistic to think that I'm always going to risk seeing duplicates if I try to query for results in real time due to replication lag between your servers. But I see how your using max_id in the paging stuff and I don't really need real time results so it seems like I should be able to use an ID that's 30 - 60 minutes old and do all of my queries using max_id instead of since_id. In theory this would have me trailing the edge of new results coming into the index by 30 - 60 minutes but it would give the servers more time to replicate so it seems like there'd be less of a chance I'd encounter dupes or missing entries. If that approach would work (and you would know) I'd just want to make sure you'd be ok with me using max_id instead of since_id given that max_id isn't documented -steve On Apr 16, 7:58 am, Matt Sanford m...@twitter.com wrote: Hi all, There was a problem yesterday with several of the search back-ends falling behind. This meant that if your page=1 and page=2 queries hit different hosts they could return results that don't line up. If your page=2 query hit a host with more lag you would miss results, and if it hit a host that was more up-to-date you would see duplicates. We're working on fixing this issues and trying to find a way to prevent incorrect pagination in the future. Sorry for the delay in replying but I was focusing all of my attention on fixing the issue and had to let email wait. Thanks; — Matt Sanford / @mzsanford On Apr 15, 2009, at 09:29 PM, stevenic wrote: Ok... So I think I know what's going on. Well I don't know what's causing the bug obviously but I think I've narrowed down where it is... I just issued the Page 1 or previous query for the above example and the ID's don't match the ID's from the original query. There are extra rows that come back... 3 to be exact. So the pagination queries are working fine. It's the initial query that's busted. It looks like that when you do a pagenation query you get back all rows matching the filter but a query without max_id sometimes drops rows. Well in my case it seems to drop rows everytime... This should get fixed... * for: http://search.twitter.com/search.atom?max_id=1530963910page=1q=http feed xmlns:google=http://base.google.com/ns/1.0; xml:lang=en-US xmlns:openSearch=http://a9.com/-/spec/opensearch/1.1/; xmlns=http:// www.w3.org/2005/Atom xmlns:twitter=http://api.twitter.com/; link type=application/atom+xml rel=self href=http:// search.twitter.com/search.atom?max_id=1530963910page=1q=http / twitter:warningadjusted since_id, it was older than allowed/ twitter:warning updated2009-04-16T03:25:30Z/updated openSearch:itemsPerPage15/openSearch:itemsPerPage openSearch:languageen/openSearch:language link type=application/atom+xml rel=next href=http:// search.twitter.com/search.atom?max_id=1530963910page=2q=http / ...Removed... entry idtag:search.twitter.com,2005:1530963910/id published2009-04-16T03:25:30Z/published /entry entry idtag:search.twitter.com,2005:1530963908/id published2009-04-16T03:25:32Z/published ...Where Did This Come From?... /entry entry idtag:search.twitter.com,2005:1530963898/id published2009-04-16T03:25:30Z/published ...And This?... /entry idtag:search.twitter.com,2005:1530963896/id idtag:search.twitter.com,2005:1530963895/id idtag:search.twitter.com,2005:1530963894/id entry idtag:search.twitter.com,2005:1530963892/id published2009-04-16T03:25:32Z/published ...And This?... /entry idtag:search.twitter.com,2005:1530963881/id idtag:search.twitter.com,2005:1530963865/id idtag:search.twitter.com,2005:1530963860/id idtag:search.twitter.com,2005:1530963834/id idtag:search.twitter.com,2005:1530963833/id idtag:search.twitter.com,2005:1530963829/id idtag:search.twitter.com,2005:1530963827/id idtag:search.twitter.com,2005:1530963812/id /feed- Hide quoted text - - Show quoted text -
[twitter-dev] Re: Search result pagination bugs
It would be helpful if you could give some example output/results where you are seeing duplicates across pages. I have spent a long long time with the Search API and haven't ever had this problem (or maybe I have and never noticed it). -Chad On Wed, Apr 15, 2009 at 9:07 PM, steve ick...@gmail.com wrote: I've been using the Search API in a project and its been working very reliably. So today I decided to add support for pagination so I could pull in more results and I think I've identified a couple of bugs with the pagination code. Bug 1) The first few results of Page 2 for a query are sometimes duplicates. To verify this do the following: 1. Execute the query: http://search.twitter.com/search.atom?lang=enq=httprpp=100 2. Grab the next link from the results and execute that. 3. Compare the ID's at the end of set one with the ID's at the begining of set 2. They sometimes overlap. Bug 2) The second bug may be the cause of the 1st bug. The link you get for next in a result set is missing the lang=en query param. So you end up getting non-english items in your result set. You can manually add the lang=en param to your query and while you still get dupes you get less. If you do this though you then start getting a warning in the result set about an adjusted since_id. What's scarier though is that the result set seemed to get wierd on me if I added the lang param and requested pages too fast. By that I mean I would sometimes get results for Page 2 that were (time wise) hours before my original Since ID so my code would just stop requesting pages since it assumed it had reached the end of the set. The scary part... Adding around a 2 seconds sleep between queries seemed to make this issue go away... In general the pagination stuff with the next link doesn't seem very reliable to me. You do seem to get less dupes then just calling search and incrementing the page number. But I'm still seeing dupes, results for the wrong language, and sometimes totally wierd results. -steve
[twitter-dev] Re: Search result pagination bugs
Sure... It repros for me every time in IE using the steps I outlined above. Do a query for lang=enq=http. Open the next link in a new tab of your browser and compare the ID's. So I just did this from my home PC and here's the condensed output. Notice that on Page 2 not only do I get 3 dupes but I even get a result that should have been on Page 1... I hadn't seen that one before but I'll assume that maybe a different server serviced each request and they're not synced. * for: http://search.twitter.com/search.atom?lang=enq=http feed xmlns:google=http://base.google.com/ns/1.0; xml:lang=en-US xmlns:openSearch=http://a9.com/-/spec/opensearch/1.1/; xmlns=http:// www.w3.org/2005/Atom xmlns:twitter=http://api.twitter.com/; link type=application/atom+xml rel=self href=http:// search.twitter.com/search.atom?lang=enq=http / twitter:warningadjusted since_id, it was older than allowed/ twitter:warning updated2009-04-16T03:25:30Z/updated openSearch:itemsPerPage15/openSearch:itemsPerPage openSearch:languageen/openSearch:language link type=application/atom+xml rel=next href=http:// search.twitter.com/search.atom?max_id=1530963910page=2q=http / ...removed... entry idtag:search.twitter.com,2005:1530963910/id published2009-04-16T03:25:30Z/published ...removed... /entry entry idtag:search.twitter.com,2005:1530963896/id /entry idtag:search.twitter.com,2005:1530963895/id idtag:search.twitter.com,2005:1530963894/id idtag:search.twitter.com,2005:1530963881/id idtag:search.twitter.com,2005:1530963865/id idtag:search.twitter.com,2005:1530963860/id idtag:search.twitter.com,2005:1530963834/id idtag:search.twitter.com,2005:1530963833/id idtag:search.twitter.com,2005:1530963829/id idtag:search.twitter.com,2005:1530963827/id idtag:search.twitter.com,2005:1530963812/id idtag:search.twitter.com,2005:1530963811/id idtag:search.twitter.com,2005:1530963796/id idtag:search.twitter.com,2005:1530963786/id /feed * for: http://search.twitter.com/search.atom?max_id=1530963910page=2q=http feed xmlns:google=http://base.google.com/ns/1.0; xml:lang=en-US xmlns:openSearch=http://a9.com/-/spec/opensearch/1.1/; xmlns=http:// www.w3.org/2005/Atom xmlns:twitter=http://api.twitter.com/; link type=application/atom+xml rel=self href=http:// search.twitter.com/search.atom?max_id=1530963910page=2q=http / updated2009-04-16T03:25:31Z/updated openSearch:itemsPerPage15/openSearch:itemsPerPage openSearch:languageen/openSearch:language link type=application/atom+xml rel=previous href=http:// search.twitter.com/search.atom?max_id=1530963910page=1q=http / link type=application/atom+xml rel=next href=http:// search.twitter.com/search.atom?max_id=1530963910page=3q=http / ...Removed... entry idtag:search.twitter.com,2005:1530963811/id published2009-04-16T03:25:31Z/published ...Duplicate 1... /entry entry idtag:search.twitter.com,2005:1530963803/id published2009-04-16T03:25:29Z/published twitter:langen/twitter:lang ...Not Even In Previous Page... /entry entry idtag:search.twitter.com,2005:1530963796/id published2009-04-16T03:25:29Z/published ...Duplicate 2... /entry entry idtag:search.twitter.com,2005:1530963786/id published2009-04-16T03:25:31Z/published ...Duplicate 3... /entry entry idtag:search.twitter.com,2005:1530963777/id ...First New Result (save the one above)... /entry idtag:search.twitter.com,2005:1530963755/id idtag:search.twitter.com,2005:1530963732/id idtag:search.twitter.com,2005:1530963725/id idtag:search.twitter.com,2005:1530963718/id idtag:search.twitter.com,2005:1530963710/id idtag:search.twitter.com,2005:1530963709/id idtag:search.twitter.com,2005:1530963706/id idtag:search.twitter.com,2005:1530963699/id idtag:search.twitter.com,2005:1530963698/id idtag:search.twitter.com,2005:1530963690/id /feed
[twitter-dev] Re: Search result pagination bugs
Ok... So I think I know what's going on. Well I don't know what's causing the bug obviously but I think I've narrowed down where it is... I just issued the Page 1 or previous query for the above example and the ID's don't match the ID's from the original query. There are extra rows that come back... 3 to be exact. So the pagination queries are working fine. It's the initial query that's busted. It looks like that when you do a pagenation query you get back all rows matching the filter but a query without max_id sometimes drops rows. Well in my case it seems to drop rows everytime... This should get fixed... * for: http://search.twitter.com/search.atom?max_id=1530963910page=1q=http feed xmlns:google=http://base.google.com/ns/1.0; xml:lang=en-US xmlns:openSearch=http://a9.com/-/spec/opensearch/1.1/; xmlns=http:// www.w3.org/2005/Atom xmlns:twitter=http://api.twitter.com/; link type=application/atom+xml rel=self href=http:// search.twitter.com/search.atom?max_id=1530963910page=1q=http / twitter:warningadjusted since_id, it was older than allowed/ twitter:warning updated2009-04-16T03:25:30Z/updated openSearch:itemsPerPage15/openSearch:itemsPerPage openSearch:languageen/openSearch:language link type=application/atom+xml rel=next href=http:// search.twitter.com/search.atom?max_id=1530963910page=2q=http / ...Removed... entry idtag:search.twitter.com,2005:1530963910/id published2009-04-16T03:25:30Z/published /entry entry idtag:search.twitter.com,2005:1530963908/id published2009-04-16T03:25:32Z/published ...Where Did This Come From?... /entry entry idtag:search.twitter.com,2005:1530963898/id published2009-04-16T03:25:30Z/published ...And This?... /entry idtag:search.twitter.com,2005:1530963896/id idtag:search.twitter.com,2005:1530963895/id idtag:search.twitter.com,2005:1530963894/id entry idtag:search.twitter.com,2005:1530963892/id published2009-04-16T03:25:32Z/published ...And This?... /entry idtag:search.twitter.com,2005:1530963881/id idtag:search.twitter.com,2005:1530963865/id idtag:search.twitter.com,2005:1530963860/id idtag:search.twitter.com,2005:1530963834/id idtag:search.twitter.com,2005:1530963833/id idtag:search.twitter.com,2005:1530963829/id idtag:search.twitter.com,2005:1530963827/id idtag:search.twitter.com,2005:1530963812/id /feed