Thought I'd chime in here and add my support for Phrirehose - Fenn, nice work!
We just did our first test-run with Phirehose on Sunday to track all of the traffic related to Super Bowl ads. At peak, it was pulling a consistent 120 tweets/sec with ease. We were only limited by our account's rate-limiting. The results of our experiment are here, if you're interested: http://squawq.com/superbowl/ I'd recommend dumping the raw tweets to disk and rotating the files at 1-10 minute intervals, depending on volume. You'll want the files to be small enough so that they can easily be loaded into memory, parsed, and filtered. You can write another php script to consume those files, do all of the phrase matching and "AND" joining, reproducing most of the functionality that the search api does for you out of the box. We originally built our tweet import pipeline last year around the search api, so for us it ended up being pretty convenient to parse the streaming api logs, perform any join operations, and then export that to a filtered tweet log with a format that simulates what you'd see coming out of the search API. Those filtered logs can then be consumed by the existing system, allowing the same back end to talk to either the search api or the streaming api, as needed. Good luck! You'll be able to make it work. Jason (@jmstriegel) On Feb 7, 6:11 pm, Fenn <[email protected]> wrote: > Hey Guys, > > I'm the author ofPhirehoseso thought I'd jump in with some quick > thoughts - > > Russ, you're spot on - what you describe there is exactly howPhirehoseis > designed to be used. > > For both of your guys use cases, the number of tweets seems to be very > low (ie: ~200/week), in which case, to be completely honest, you could > _probably_ just decode/process the tweets inline in the > enqueueStatus() method. > > The reason this is highly discouraged (as you mention, Russ) is that > if you ever did get a sudden peak (ie: the term you're tracking became > a trending topic) you could end up getting a backlog in your stream > which can cause you to become disconnected - To quote the twitter wiki > on an API client: > > "...should be isolated from any subsequent downstream processing > backlog or maintenance, otherwise queuing will occur in the Streaming > API. Eventually your client will be disconnected, resulting in data > loss..." > > maestrojed: Although I disclaim my examples as not being "production > ready", this is simply because I have not tested it extensively, but I > have no reason to believe it would not work. There are quite a few > people using it in production, ie: Thomas > here:http://groups.google.com/group/phirehose-users/msg/af2c6eb424d7e117 > who has it processing 1000 tweets per minute quite happily. > > If you prefer not to work with flat files you could definitely insert > straight into a database. The reason why flat files are nice (for > single threaded applications) is that you're relying on the disk only. > A modern hard drive can easily write 40MB/s which should be far higher > than any twitter stream you're likely to encounter. Conversely, if you > write to a DB, there are other things to consider (row/table locks, > concurrency, etc). > > To be completely honest, the clauses above are overly paranoid - > unless you're processing large amounts of tweets per second you > probably don't need to worry about this sort of thing, however you > should always remember that twitter is growing rapidly and what may be > a "quiet stream" today could be a raging torrent tomorrow. > > Cheers! > > Fenn. > > On Jan 27, 10:03 am, phptek <[email protected]> wrote: > > > > > I too am usingPhirehosefor a similar small no. of tweets. > > > The general idea with streaming is not to process stuff live "on the > > wire" but to parse it into files or a DB and then further process this > > data from some additional code. > > > As an example: I am looking to get geocoded tweets for specific areas > > in New Zealand which I can do withPhirehose. The idea with libs like > >Phirehoseand others is to use it as a base for your own further work. > > So what I will do (as soon as I figure out why I'm getting blank lines > > out of curl) is perhaps to modify the enqueueStatus() function in > > filter-track-geo.php and instead o f using print_r() and printing > > everything to the screen, I will create my own routing to instert the > > tweet-data into a DB. > > > I will then likely create a newPhirehosemethod (or procedural code > > first!) to query the DB and post-process the tweets for specific > > keywords. > > > I do it in this 2-stage process 1). because inserting into a DB first > > gives you some 'buffer' from issues that the API states can arise in a > > stream and 2). AFAIK you can't concatenate parameters together like > > 'location' and 'track' as they are logically "Or'd" together (probably > > due to the potential for the API to easily overload twitter's servers. > > > Does any of that make sense? I haven't written any code as yet so > > can't give any concrete example, but I'm assuming you are at least > > familiar with PHP? > > > Cheers > > Russ > > > On Jan 27, 7:21 am, maestrojed <[email protected]> wrote: > > > > For a project I want to collect all tweets containing a keyword and > > > store them in a database. I have built this functionality using the > > > search api but was missing tweets. I was told to switch to the > > > streaming API. You can see that post > > > here:http://groups.google.com/group/twitter-development-talk/browse_thread... > > > > Although using the search API was straight forward, even easy, I am > > > quite out of my league with this streaming API. In fact, I have never > > > worked with streaming data at all. I have read the documentation and > > > the only 2 examples I could find anywhere on the internet. There is a > > > PHP library calledphirehosehttp://code.google.com/p/phirehose/but > > > it only helps with the streaming connection. I am still at a lose as > > > to how to process the tweets.. The example included with thephirehose > > > library is the most complete example I could find but it states that > > > it is not ready for production. This example writes all the tweets to > > > a flat file which I guess can then be parsed and stored in a db. Is > > > this a necessary step? Could one go straight to the db? Can anyone > > > help me out? Of course any example of production ready code would be > > > amazingly cool but realistically if someone can point out what issues > > > need to be addressed in this code that could help a lot too. > > > Here is that example:http://pastebin.com/fe677e00 > > > > Maybe I am asking for too much. I know I probably am. I just would > > > love to have these tweets stored in my db and know that as it is now I > > > don't have the knowledge to confidently release code into production. > > > BTW, the keyword I am targeting is not very popular, ~200 tweets a > > > week and this project is just to indefinitely store these tweets and > > > display them on a web page. i am not building a client or anything > > > like that. Its a fairly small project. > > > > This was the other example I worked on and although it worked, the > > > author states it is lacking some > > > necessities:http://blog.corunet.com/twitter-alerts-using-twitter-streaming-api/
