[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets

2011-02-20 Thread Karussell
Thanks for the update John!

On 18 Feb., 19:08, John Kalucki j...@twitter.com wrote:
 http://dev.twitter.com/pages/streaming_api_concepts#result-quality

 Search filters for relevance and is not intended as a source of all tweets.
 Streaming provides the complete record to all you to perform whatever
 post-processing you'd like.

 -John Kaluckihttp://twitter.com/jkalucki
 Twitter, Inc.

 On Thu, Feb 17, 2011 at 12:15 AM, Karussell 
 tableyourt...@googlemail.comwrote:

  Hi Matt,

  sorry for being unspecific. By 'only in async' I meant tweets which
  were only found by the streaming API ('asynchronous retrieval') but
  were not in the search results **

  Why are they missing when using search API?

   Also can you give an example of what you mean by a long Tweet.

  I investingated this a bit more and it seems to be intendend (?):
  these tweets are 'only' retweets. As example here is one too short
  tweet returned from the streaming API:

  RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
  @brjavaman  Kirk Pepperdine is out!http://bit.ly/eikmuxis
  #Java ...

  and the same tweet (id == 37959896615886848) was more complete when
  returned from the search API:

  RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
  @brjavaman  Kirk Pepperdine is out!http://bit.ly/eikmuxis #Java a
  dead-end?

  So, when I use search API I'll miss tweets and when using streaming
  API I'll miss text? Do I need to use both?

  Regards,
  Peter.

  **
  37952879822110720 Architecte Java J2EE: Priorité sera donnée à un
  candidat de la région nantaise. Merci de tran...http://bit.ly/dQhIoK
  #freelance #offres
  37954149668622336 به روز رسانی: Nimbuzz اکنون با پشتیبانی از اتصال
  رسمی API فیس بوکhttp://t.co/ICgTAXX
  37954912847400960 『Java Hangs When Converting 2.2250738585072012e-308』
 http://zennin.blog55.fc2.com/blog-entry-2773.html
  37956641609621504 Mastering Grails: Grails in the enterprise
 https://www.ibm.com/developerworks/java/library/j-grails12168/#grails
  37956994061176832 NEW! FileNet - Java/J2EE Developer - Vigilant
  Technologies:  ( #Columbus , OH)http://bit.ly/e6ULEw#OpenSource
  #Jobs #Job #TweetMyJOBS
  37957325557989376 After a day of Java programming in Eclipse, C++
  programming in Visual Studio just feels slow and crappy :(

  more examples in the given file:
 https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt

  --
  Twitter developer documentation and resources:http://dev.twitter.com/doc
  API updates via Twitter:http://twitter.com/twitterapi
  Issues/Enhancements Tracker:
 http://code.google.com/p/twitter-api/issues/list
  Change your membership to this group:
 http://groups.google.com/group/twitter-development-talk

-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk


Re: [twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets

2011-02-18 Thread John Kalucki
http://dev.twitter.com/pages/streaming_api_concepts#result-quality

Search filters for relevance and is not intended as a source of all tweets.
Streaming provides the complete record to all you to perform whatever
post-processing you'd like.

-John Kalucki
http://twitter.com/jkalucki
Twitter, Inc.


On Thu, Feb 17, 2011 at 12:15 AM, Karussell tableyourt...@googlemail.comwrote:

 Hi Matt,

 sorry for being unspecific. By 'only in async' I meant tweets which
 were only found by the streaming API ('asynchronous retrieval') but
 were not in the search results **

 Why are they missing when using search API?

  Also can you give an example of what you mean by a long Tweet.

 I investingated this a bit more and it seems to be intendend (?):
 these tweets are 'only' retweets. As example here is one too short
 tweet returned from the streaming API:

 RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
 @brjavaman  Kirk Pepperdine is out! http://bit.ly/eikmux is
 #Java ...

 and the same tweet (id == 37959896615886848) was more complete when
 returned from the search API:

 RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
 @brjavaman  Kirk Pepperdine is out! http://bit.ly/eikmux is #Java a
 dead-end?

 So, when I use search API I'll miss tweets and when using streaming
 API I'll miss text? Do I need to use both?

 Regards,
 Peter.

 **
 37952879822110720 Architecte Java J2EE: Priorité sera donnée à un
 candidat de la région nantaise. Merci de tran... http://bit.ly/dQhIoK
 #freelance #offres
 37954149668622336 به روز رسانی: Nimbuzz اکنون با پشتیبانی از اتصال
 رسمی API فیس بوک http://t.co/ICgTAXX
 37954912847400960 『Java Hangs When Converting 2.2250738585072012e-308』
 http://zennin.blog55.fc2.com/blog-entry-2773.html
 37956641609621504 Mastering Grails: Grails in the enterprise
 https://www.ibm.com/developerworks/java/library/j-grails12168/ #grails
 37956994061176832 NEW! FileNet - Java/J2EE Developer - Vigilant
 Technologies:  ( #Columbus , OH) http://bit.ly/e6ULEw #OpenSource
 #Jobs #Job #TweetMyJOBS
 37957325557989376 After a day of Java programming in Eclipse, C++
 programming in Visual Studio just feels slow and crappy :(

 more examples in the given file:
 https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt

 --
 Twitter developer documentation and resources: http://dev.twitter.com/doc
 API updates via Twitter: http://twitter.com/twitterapi
 Issues/Enhancements Tracker:
 http://code.google.com/p/twitter-api/issues/list
 Change your membership to this group:
 http://groups.google.com/group/twitter-development-talk


-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk


[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets

2011-02-17 Thread Karussell
sorry, once more again:

With 'only in async' I meant tweets which were only retrieved via the
streaming API but not via search API

-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk


[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets

2011-02-17 Thread Karussell
Hi Matt,

sorry for being unspecific. By 'only in async' I meant tweets which
were only found by the streaming API ('asynchronous retrieval') but
were not in the search results **

Why are they missing when using search API?

 Also can you give an example of what you mean by a long Tweet.

I investingated this a bit more and it seems to be intendend (?):
these tweets are 'only' retweets. As example here is one too short
tweet returned from the streaming API:

RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
@brjavaman  Kirk Pepperdine is out! http://bit.ly/eikmux is
#Java ...

and the same tweet (id == 37959896615886848) was more complete when
returned from the search API:

RT @bcoders: Episode 33 onsite from @JFokus with @neal4d @nicksieger
@brjavaman  Kirk Pepperdine is out! http://bit.ly/eikmux is #Java a
dead-end?

So, when I use search API I'll miss tweets and when using streaming
API I'll miss text? Do I need to use both?

Regards,
Peter.

**
37952879822110720 Architecte Java J2EE: Priorité sera donnée à un
candidat de la région nantaise. Merci de tran... http://bit.ly/dQhIoK
#freelance #offres
37954149668622336 به روز رسانی: Nimbuzz اکنون با پشتیبانی از اتصال
رسمی API فیس بوک http://t.co/ICgTAXX
37954912847400960 『Java Hangs When Converting 2.2250738585072012e-308』
http://zennin.blog55.fc2.com/blog-entry-2773.html
37956641609621504 Mastering Grails: Grails in the enterprise
https://www.ibm.com/developerworks/java/library/j-grails12168/ #grails
37956994061176832 NEW! FileNet - Java/J2EE Developer - Vigilant
Technologies:  ( #Columbus , OH) http://bit.ly/e6ULEw #OpenSource
#Jobs #Job #TweetMyJOBS
37957325557989376 After a day of Java programming in Eclipse, C++
programming in Visual Studio just feels slow and crappy :(

more examples in the given file:
https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt

-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk


[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets

2011-02-16 Thread Karussell
Hi John,

Well, for a search term 'java' the async API is 'ok' and the
differences 'only in search' can be easily explained: the keywords are
in the URL.
But the differences 'only in async' (tweets grabbed only via streaming
API) are strange to me:

https://github.com/karussell/TestTwitterAPI/blob/master/discrepancy.txt

Why are they lost?

You can build the java mini programm via (or via your favourite IDE):
mvn clean install

and call it via:
./myjava -Dtwitter4j.oauth.consumerKey=key -
Dtwitter4j.oauth.consumerSecret=value de.jetwick.tw.NewClass java
token tokenSecret

to see what I mean ...

Another strange fact is that a lot of long tweets retrieved via the
streaming api have a text which is ~15 character shorter than the
identical tweet from the search API !

Regards,
Peter.

--

http://jetwick.com Twitter Search without Noise

-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk


[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets

2011-02-15 Thread Colin Surprenant
First your test set is a bit small. Did you take into account the
extra data you will get in your first search api poll? Typically your
first poll will return 100 items then subsequent polls will return
only new data if using since_id and/or dedupping.

Make sure both your poller and stream reader start at the same item. A
trick, if you want to grab as much similar results at possible from
the start is to request only a single item on the first poll (using
rpp=1) (or use only the most recent item of your result) then use this
item to seed your since_id on the following polls. Another idea might
be to start your stream reader first and use the first item returned
by your reader to again seed your since_id in your poller. Also you
can simply ignore this during collection but cleanup your data once
your done collecting and make sure both data sets start and end with
the same item ID.

In any case, if your difference is in fact related to the handling of
the first poll, it will become marginal as your data grow.

I also ran some tests to compare results between both methods using a
single keyword. With result sets of about 15000 ids, both sets are
identical at 98.3%. For testing purposes both my poller and stream
reader only output IDs so I can use cat, sort, uniq, wc and diff to
compare results.

Colin

On Feb 15, 2:33 pm, Karussell tableyourt...@googlemail.com wrote:
 Hi John, hi Adam,

 thanks for your responses.

  To increase recall, search sometimes includes keywords in followed links 
  and other techniques

 ah, ok. this would explain the differences between C and B (but not
 betweet C and A). I'll investigate ...

  Also, are you getting rate limit messages on the Streaming API?

 no.
 I saw track limits (or something) when my keyword was 'java' or a
 similar high frequent term.

 Regards,
 Peter.

 On 15 Feb., 18:30, John Kalucki j...@twitter.com wrote:







  If you examine set C, do they contain matches on fields other than the Tweet
  text? To increase recall, search sometimes includes keywords in followed
  links and other techniques.

  Also, are you getting rate limit messages on the Streaming API?

  -John Kaluckihttp://twitter.com/jkalucki
  Twitter, Inc.

  On Tue, Feb 15, 2011 at 3:36 AM, Karussell 
  tableyourt...@googlemail.comwrote:

   Hi,

   this problem was already posted to the twitter4j mailing list [1]. Not
   sure if it is an issue with my code, twitter4j or an API issue... user
   reported similar problems in the past [2].

   First:

   I'm doing a 100 tweet search (without paging) every 5 minutes e.g.
   against 'twitter search'. I get a set of tweets A - excluding the
   duplicates, of course. I get approx 5 new tweets for every 5 minutes,
   so 100 tweets as pageSize should be perfectly sufficient to get all
   tweets.

   Second:
   When I'm doing a streaming filter request for the same terms 'twitter
   search' then I'm getting a set of tweets B.

   The problem is: combining A and B ('C=A v B') gives me a set C where
   the count of C is more than 10% larger then A or B, which means that
   neither with search nor streaming API I can catch a nearly complete
   set of tweets.

   E.g. doing this for 3 hours I'm getting 254 tweets (A) for the search
   and 257 tweets (B) for the streaming but the combined set C has 337
   tweets!

   Is this a bug in my code or could this be an API issue?

   BTW: I don't assume 100% correctness, I only want something above
   90% :) especially for such relatively infrequent terms, where users
   can, should and have noticed it.

   Regards,
   Peter.

   [1]
  http://groups.google.com/group/twitter4j/msg/d959e6257ceb452f

   [2]

  http://groups.google.com/group/twitter-development-talk/browse_thread...

  http://blog.tweetsmarter.com/twitter-downtime/twitters-dirty-secret-t...

   --

  http://jetwick.comTwitterSearch without Noise

   --
   Twitter developer documentation and resources:http://dev.twitter.com/doc
   API updates via Twitter:http://twitter.com/twitterapi
   Issues/Enhancements Tracker:
  http://code.google.com/p/twitter-api/issues/list
   Change your membership to this group:
  http://groups.google.com/group/twitter-development-talk

-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk


[twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets

2011-02-15 Thread Karussell
Hi Colin, hi John,

 To increase recall, search sometimes includes keywords in followed links and 
 other techniques.

This is indeed the case. and 'twitter search' is a lot in urls ala:

http://search.twitter.com/search?q=jetwick

that is where the big differences came from. Can I turn off this
'feature'? It shouldn't take into account that. Although the title of
the web site should taken into account ... like it is done in
jetwick ;)

I'll investigate for other keywords now.

 Typically your first poll will return 100 items then subsequent polls
 will return only new data if using since_id and/or dedupping.

I already removed these early tweets, of course ...

 I also ran some tests

with which keywords do you ran the tests?

 For testing purposes both my poller and stream reader only output IDs
  so I can use cat, sort, uniq, wc and diff to compare results.

Yes, I went the same way :)

Regards,
Peter.

-- 
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk


Re: [twitter-dev] Re: Streaming API vs. Search API: no API returns 95% of intented tweets

2011-02-15 Thread M. Edward (Ed) Borasky
On Tue, 15 Feb 2011 21:01:07 -0800, John Kalucki j...@twitter.com 
wrote:

On every occasion where I've tested the Firehose and track terms from
the Streaming API against the Tweet database and against each other,
there is no loss -- all the sources match exactly. Unless there's 
some
unusual operational instability, the Streaming API returns 100% of 
the

tweets requested, or sends a limit message to let you know what has
been dropped.


What has been dropped, or how many have been dropped? ;-)

--
http://twitter.com/znmeb http://borasky-research.net

A mathematician is a device for turning coffee into theorems. -- Paul 
Erdős


--
Twitter developer documentation and resources: http://dev.twitter.com/doc
API updates via Twitter: http://twitter.com/twitterapi
Issues/Enhancements Tracker: http://code.google.com/p/twitter-api/issues/list
Change your membership to this group: 
http://groups.google.com/group/twitter-development-talk