First your test set is a bit small. Did you take into account the extra data you will get in your first search api poll? Typically your first poll will return 100 items then subsequent polls will return only "new" data if using since_id and/or dedupping.
Make sure both your poller and stream reader start at the same item. A trick, if you want to grab as much similar results at possible from the start is to request only a single item on the first poll (using rpp=1) (or use only the most recent item of your result) then use this item to seed your since_id on the following polls. Another idea might be to start your stream reader first and use the first item returned by your reader to again seed your since_id in your poller. Also you can simply ignore this during collection but cleanup your data once your done collecting and make sure both data sets start and end with the same item ID. In any case, if your difference is in fact related to the handling of the first poll, it will become marginal as your data grow. I also ran some tests to compare results between both methods using a single keyword. With result sets of about 15000 ids, both sets are identical at 98.3%. For testing purposes both my poller and stream reader only output IDs so I can use cat, sort, uniq, wc and diff to compare results. Colin On Feb 15, 2:33 pm, Karussell <tableyourt...@googlemail.com> wrote: > Hi John, hi Adam, > > thanks for your responses. > > > To increase recall, search sometimes includes keywords in followed links > > and other techniques > > ah, ok. this would explain the differences between C and B (but not > betweet C and A). I'll investigate ... > > > Also, are you getting rate limit messages on the Streaming API? > > no. > I saw track limits (or something) when my keyword was 'java' or a > similar high frequent term. > > Regards, > Peter. > > On 15 Feb., 18:30, John Kalucki <j...@twitter.com> wrote: > > > > > > > > > If you examine set C, do they contain matches on fields other than the Tweet > > text? To increase recall, search sometimes includes keywords in followed > > links and other techniques. > > > Also, are you getting rate limit messages on the Streaming API? > > > -John Kaluckihttp://twitter.com/jkalucki > > Twitter, Inc. > > > On Tue, Feb 15, 2011 at 3:36 AM, Karussell > > <tableyourt...@googlemail.com>wrote: > > > > Hi, > > > > this problem was already posted to the twitter4j mailing list . Not > > > sure if it is an issue with my code, twitter4j or an API issue... user > > > reported similar problems in the past . > > > > First: > > > > I'm doing a 100 tweet search (without paging) every 5 minutes e.g. > > > against 'twitter search'. I get a set of tweets A - excluding the > > > duplicates, of course. I get approx 5 new tweets for every 5 minutes, > > > so 100 tweets as pageSize should be perfectly sufficient to get all > > > tweets. > > > > Second: > > > When I'm doing a streaming filter request for the same terms 'twitter > > > search' then I'm getting a set of tweets B. > > > > The problem is: combining A and B ('C=A v B') gives me a set C where > > > the count of C is more than 10% larger then A or B, which means that > > > neither with search nor streaming API I can catch a nearly complete > > > set of tweets. > > > > E.g. doing this for 3 hours I'm getting 254 tweets (A) for the search > > > and 257 tweets (B) for the streaming but the combined set C has 337 > > > tweets! > > > > Is this a bug in my code or could this be an API issue? > > > > BTW: I don't assume 100% correctness, I only want something above > > > 90% :) especially for such relatively infrequent terms, where users > > > can, should and have noticed it. > > > > Regards, > > > Peter. > > > >  > > >http://groups.google.com/group/twitter4j/msg/d959e6257ceb452f > > > >  > > > >http://groups.google.com/group/twitter-development-talk/browse_thread... > > > >http://blog.tweetsmarter.com/twitter-downtime/twitters-dirty-secret-t... > > > > -- > > > >http://jetwick.comTwitterSearch without Noise > > > > -- > > > Twitter developer documentation and resources:http://dev.twitter.com/doc > > > API updates via Twitter:http://twitter.com/twitterapi > > > Issues/Enhancements Tracker: > > >http://code.google.com/p/twitter-api/issues/list > > > Change your membership to this group: > > >http://groups.google.com/group/twitter-development-talk -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk