[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets
Thanks for the update John! On 18 Feb., 19:08, John Kalucki j...@twitter.com wrote: http://dev.twitter.com/pages/streaming_api_concepts#result-quality Search filters for relevance and is not intended as a source of all tweets. Streaming provides the complete record to all you to perform whatever post-processing you'd like. -John Kaluckihttp://twitter.com/jkalucki Twitter, Inc. On Thu, Feb 17, 2011 at 12:15 AM, Karussell tableyourt...@googlemail.comwrote: Hi Matt, sorry for being unspecific. By 'only in async' I meant tweets which were only found by the streaming API ('asynchronous retrieval') but were not in the search results ** Why are they missing when using search API? Also can you give an example of what you mean by a long Tweet. I investingated this a bit more and it seems to be intendend (?): these tweets are 'only' retweets. As example here is one too short tweet returned from the streaming API: RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger @brjavaman Kirk Pepperdine is out!http://bit.ly/eikmuxis #Java ... and the same tweet (id == 37959896615886848) was more complete when returned from the search API: RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger @brjavaman Kirk Pepperdine is out!http://bit.ly/eikmuxis #Java a dead-end? So, when I use search API I'll miss tweets and when using streaming API I'll miss text? Do I need to use both? Regards, Peter. ** 37952879822110720 Architecte Java J2EE: Priorité sera donnée à un candidat de la région nantaise. Merci de tran...http://bit.ly/dQhIoK #freelance #offres 37954149668622336 به روز رسانی: Nimbuzz اکنون با پشتیبانی از اتصال رسمی API فیس بوکhttp://t.co/ICgTAXX 37954912847400960 『Java Hangs When Converting 2.2250738585072012e-308』 http://zennin.blog55.fc2.com/blog-entry-2773.html 37956641609621504 Mastering Grails: Grails in the enterprise https://www.ibm.com/developerworks/java/library/j-grails12168/#grails 37956994061176832 NEW! FileNet - Java/J2EE Developer - Vigilant Technologies: ( #Columbus , OH)http://bit.ly/e6ULEw#OpenSource #Jobs #Job #TweetMyJOBS 37957325557989376 After a day of Java programming in Eclipse, C++ programming in Visual Studio just feels slow and crappy :( more examples in the given file: https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt -- Twitter developer documentation and resources:http://dev.twitter.com/doc API updates via Twitter:http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk
Re: [twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets
http://dev.twitter.com/pages/streaming_api_concepts#result-quality Search filters for relevance and is not intended as a source of all tweets. Streaming provides the complete record to all you to perform whatever post-processing you'd like. -John Kalucki http://twitter.com/jkalucki Twitter, Inc. On Thu, Feb 17, 2011 at 12:15 AM, Karussell tableyourt...@googlemail.comwrote: Hi Matt, sorry for being unspecific. By 'only in async' I meant tweets which were only found by the streaming API ('asynchronous retrieval') but were not in the search results ** Why are they missing when using search API? Also can you give an example of what you mean by a long Tweet. I investingated this a bit more and it seems to be intendend (?): these tweets are 'only' retweets. As example here is one too short tweet returned from the streaming API: RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger @brjavaman Kirk Pepperdine is out! http://bit.ly/eikmux is #Java ... and the same tweet (id == 37959896615886848) was more complete when returned from the search API: RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger @brjavaman Kirk Pepperdine is out! http://bit.ly/eikmux is #Java a dead-end? So, when I use search API I'll miss tweets and when using streaming API I'll miss text? Do I need to use both? Regards, Peter. ** 37952879822110720 Architecte Java J2EE: Priorité sera donnée à un candidat de la région nantaise. Merci de tran... http://bit.ly/dQhIoK #freelance #offres 37954149668622336 به روز رسانی: Nimbuzz اکنون با پشتیبانی از اتصال رسمی API فیس بوک http://t.co/ICgTAXX 37954912847400960 『Java Hangs When Converting 2.2250738585072012e-308』 http://zennin.blog55.fc2.com/blog-entry-2773.html 37956641609621504 Mastering Grails: Grails in the enterprise https://www.ibm.com/developerworks/java/library/j-grails12168/ #grails 37956994061176832 NEW! FileNet - Java/J2EE Developer - Vigilant Technologies: ( #Columbus , OH) http://bit.ly/e6ULEw #OpenSource #Jobs #Job #TweetMyJOBS 37957325557989376 After a day of Java programming in Eclipse, C++ programming in Visual Studio just feels slow and crappy :( more examples in the given file: https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk
[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets
sorry, once more again: With 'only in async' I meant tweets which were only retrieved via the streaming API but not via search API -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk
[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets
Hi Matt, sorry for being unspecific. By 'only in async' I meant tweets which were only found by the streaming API ('asynchronous retrieval') but were not in the search results ** Why are they missing when using search API? Also can you give an example of what you mean by a long Tweet. I investingated this a bit more and it seems to be intendend (?): these tweets are 'only' retweets. As example here is one too short tweet returned from the streaming API: RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger @brjavaman Kirk Pepperdine is out! http://bit.ly/eikmux is #Java ... and the same tweet (id == 37959896615886848) was more complete when returned from the search API: RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger @brjavaman Kirk Pepperdine is out! http://bit.ly/eikmux is #Java a dead-end? So, when I use search API I'll miss tweets and when using streaming API I'll miss text? Do I need to use both? Regards, Peter. ** 37952879822110720 Architecte Java J2EE: Priorité sera donnée à un candidat de la région nantaise. Merci de tran... http://bit.ly/dQhIoK #freelance #offres 37954149668622336 به روز رسانی: Nimbuzz اکنون با پشتیبانی از اتصال رسمی API فیس بوک http://t.co/ICgTAXX 37954912847400960 『Java Hangs When Converting 2.2250738585072012e-308』 http://zennin.blog55.fc2.com/blog-entry-2773.html 37956641609621504 Mastering Grails: Grails in the enterprise https://www.ibm.com/developerworks/java/library/j-grails12168/ #grails 37956994061176832 NEW! FileNet - Java/J2EE Developer - Vigilant Technologies: ( #Columbus , OH) http://bit.ly/e6ULEw #OpenSource #Jobs #Job #TweetMyJOBS 37957325557989376 After a day of Java programming in Eclipse, C++ programming in Visual Studio just feels slow and crappy :( more examples in the given file: https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk
[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets
Hi John, Well, for a search term 'java' the async API is 'ok' and the differences 'only in search' can be easily explained: the keywords are in the URL. But the differences 'only in async' (tweets grabbed only via streaming API) are strange to me: https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt Why are they lost? You can build the java mini programm via (or via your favourite IDE): mvn clean install and call it via: ./myjava -Dtwitter4j.oauth.consumerKey=key - Dtwitter4j.oauth.consumerSecret=value de.jetwick.tw.NewClass java token tokenSecret to see what I mean ... Another strange fact is that a lot of long tweets retrieved via the streaming api have a text which is ~15 character shorter than the identical tweet from the search API ! Regards, Peter. -- http://jetwick.com Twitter Search without Noise -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk
[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets
First your test set is a bit small. Did you take into account the extra data you will get in your first search api poll? Typically your first poll will return 100 items then subsequent polls will return only new data if using since_id and/or dedupping. Make sure both your poller and stream reader start at the same item. A trick, if you want to grab as much similar results at possible from the start is to request only a single item on the first poll (using rpp=1) (or use only the most recent item of your result) then use this item to seed your since_id on the following polls. Another idea might be to start your stream reader first and use the first item returned by your reader to again seed your since_id in your poller. Also you can simply ignore this during collection but cleanup your data once your done collecting and make sure both data sets start and end with the same item ID. In any case, if your difference is in fact related to the handling of the first poll, it will become marginal as your data grow. I also ran some tests to compare results between both methods using a single keyword. With result sets of about 15000 ids, both sets are identical at 98.3%. For testing purposes both my poller and stream reader only output IDs so I can use cat, sort, uniq, wc and diff to compare results. Colin On Feb 15, 2:33 pm, Karussell tableyourt...@googlemail.com wrote: Hi John, hi Adam, thanks for your responses. To increase recall, search sometimes includes keywords in followed links and other techniques ah, ok. this would explain the differences between C and B (but not betweet C and A). I'll investigate ... Also, are you getting rate limit messages on the Streaming API? no. I saw track limits (or something) when my keyword was 'java' or a similar high frequent term. Regards, Peter. On 15 Feb., 18:30, John Kalucki j...@twitter.com wrote: If you examine set C, do they contain matches on fields other than the Tweet text? To increase recall, search sometimes includes keywords in followed links and other techniques. Also, are you getting rate limit messages on the Streaming API? -John Kaluckihttp://twitter.com/jkalucki Twitter, Inc. On Tue, Feb 15, 2011 at 3:36 AM, Karussell tableyourt...@googlemail.comwrote: Hi, this problem was already posted to the twitter4j mailing list [1]. Not sure if it is an issue with my code, twitter4j or an API issue... user reported similar problems in the past [2]. First: I'm doing a 100 tweet search (without paging) every 5 minutes e.g. against 'twitter search'. I get a set of tweets A - excluding the duplicates, of course. I get approx 5 new tweets for every 5 minutes, so 100 tweets as pageSize should be perfectly sufficient to get all tweets. Second: When I'm doing a streaming filter request for the same terms 'twitter search' then I'm getting a set of tweets B. The problem is: combining A and B ('C=A v B') gives me a set C where the count of C is more than 10% larger then A or B, which means that neither with search nor streaming API I can catch a nearly complete set of tweets. E.g. doing this for 3 hours I'm getting 254 tweets (A) for the search and 257 tweets (B) for the streaming but the combined set C has 337 tweets! Is this a bug in my code or could this be an API issue? BTW: I don't assume 100% correctness, I only want something above 90% :) especially for such relatively infrequent terms, where users can, should and have noticed it. Regards, Peter. [1] http://groups.google.com/group/twitter4j/msg/d959e6257ceb452f [2] http://groups.google.com/group/twitter-development-talk/browse_thread... http://blog.tweetsmarter.com/twitter-downtime/twitters-dirty-secret-t... -- http://jetwick.comTwitterSearch without Noise -- Twitter developer documentation and resources:http://dev.twitter.com/doc API updates via Twitter:http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk
[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets
Hi Colin, hi John, To increase recall, search sometimes includes keywords in followed links and other techniques. This is indeed the case. and 'twitter search' is a lot in urls ala: http://search.twitter.com/search?q=jetwick that is where the big differences came from. Can I turn off this 'feature'? It shouldn't take into account that. Although the title of the web site should taken into account ... like it is done in jetwick ;) I'll investigate for other keywords now. Typically your first poll will return 100 items then subsequent polls will return only new data if using since_id and/or dedupping. I already removed these early tweets, of course ... I also ran some tests with which keywords do you ran the tests? For testing purposes both my poller and stream reader only output IDs so I can use cat, sort, uniq, wc and diff to compare results. Yes, I went the same way :) Regards, Peter. -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk
Re: [twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets
On Tue, 15 Feb 2011 21:01:07 -0800, John Kalucki j...@twitter.com wrote: On every occasion where I've tested the Firehose and track terms from the Streaming API against the Tweet database and against each other, there is no loss -- all the sources match exactly. Unless there's some unusual operational instability, the Streaming API returns 100% of the tweets requested, or sends a limit message to let you know what has been dropped. What has been dropped, or how many have been dropped? ;-) -- http://twitter.com/znmeb http://borasky-research.net A mathematician is a device for turning coffee into theorems. -- Paul Erdős -- Twitter developer documentation and resources: http://dev.twitter.com/doc API updates via Twitter: http://twitter.com/twitterapi Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list Change your membership to this group: http://groups.google.com/group/twitter-development-talk