First your test set is a bit small. Did you take into account the
extra data you will get in your first search api poll? Typically your
first poll will return 100 items then subsequent polls will return
only "new" data if using since_id and/or dedupping.

Make sure both your poller and stream reader start at the same item. A
trick, if you want to grab as much similar results at possible from
the start is to request only a single item on the first poll (using
rpp=1) (or use only the most recent item of your result) then use this
item to seed your since_id on the following polls. Another idea might
be to start your stream reader first and use the first item returned
by your reader to again seed your since_id in your poller. Also you
can simply ignore this during collection but cleanup your data once
your done collecting and make sure both data sets start and end with
the same item ID.

In any case, if your difference is in fact related to the handling of
the first poll, it will become marginal as your data grow.

I also ran some tests to compare results between both methods using a
single keyword. With result sets of about 15000 ids, both sets are
identical at 98.3%. For testing purposes both my poller and stream
reader only output IDs so I can use cat, sort, uniq, wc and diff to
compare results.

Colin

On Feb 15, 2:33 pm, Karussell <tableyourt...@googlemail.com> wrote:
> Hi John, hi Adam,
>
> thanks for your responses.
>
> > To increase recall, search sometimes includes keywords in followed links 
> > and other techniques
>
> ah, ok. this would explain the differences between C and B (but not
> betweet C and A). I'll investigate ...
>
> > Also, are you getting rate limit messages on the Streaming API?
>
> no.
> I saw track limits (or something) when my keyword was 'java' or a
> similar high frequent term.
>
> Regards,
> Peter.
>
> On 15 Feb., 18:30, John Kalucki <j...@twitter.com> wrote:
>
>
>
>
>
>
>
> > If you examine set C, do they contain matches on fields other than the Tweet
> > text? To increase recall, search sometimes includes keywords in followed
> > links and other techniques.
>
> > Also, are you getting rate limit messages on the Streaming API?
>
> > -John Kaluckihttp://twitter.com/jkalucki
> > Twitter, Inc.
>
> > On Tue, Feb 15, 2011 at 3:36 AM, Karussell 
> > <tableyourt...@googlemail.com>wrote:
>
> > > Hi,
>
> > > this problem was already posted to the twitter4j mailing list [1]. Not
> > > sure if it is an issue with my code, twitter4j or an API issue... user
> > > reported similar problems in the past [2].
>
> > > First:
>
> > > I'm doing a 100 tweet search (without paging) every 5 minutes e.g.
> > > against 'twitter search'. I get a set of tweets A - excluding the
> > > duplicates, of course. I get approx 5 new tweets for every 5 minutes,
> > > so 100 tweets as pageSize should be perfectly sufficient to get all
> > > tweets.
>
> > > Second:
> > > When I'm doing a streaming filter request for the same terms 'twitter
> > > search' then I'm getting a set of tweets B.
>
> > > The problem is: combining A and B ('C=A v B') gives me a set C where
> > > the count of C is more than 10% larger then A or B, which means that
> > > neither with search nor streaming API I can catch a nearly complete
> > > set of tweets.
>
> > > E.g. doing this for 3 hours I'm getting 254 tweets (A) for the search
> > > and 257 tweets (B) for the streaming but the combined set C has 337
> > > tweets!
>
> > > Is this a bug in my code or could this be an API issue?
>
> > > BTW: I don't assume 100% correctness, I only want something above
> > > 90% :) especially for such relatively infrequent terms, where users
> > > can, should and have noticed it.
>
> > > Regards,
> > > Peter.
>
> > > [1]
> > >http://groups.google.com/group/twitter4j/msg/d959e6257ceb452f
>
> > > [2]
>
> > >http://groups.google.com/group/twitter-development-talk/browse_thread...
>
> > >http://blog.tweetsmarter.com/twitter-downtime/twitters-dirty-secret-t...
>
> > > --
>
> > >http://jetwick.comTwitterSearch without Noise
>
> > > --
> > > Twitter developer documentation and resources:http://dev.twitter.com/doc
> > > API updates via Twitter:http://twitter.com/twitterapi
> > > Issues/Enhancements Tracker:
> > >http://code.google.com/p/twitter-api/issues/list
> > > Change your membership to this group:
> > >http://groups.google.com/group/twitter-development-talk

-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk

Reply via email to