[twitter-dev] Re: Track streaming : how to match tweets?

Julien Tue, 08 Dec 2009 14:36:18 -0800

Thanks Mark, but as I said, we need to fetch more complex feeds to. So
we'll use the OR with the simple query, and then query the search API
with the complex query to see if a given tweet matches what we need!


Julien

On Dec 8, 12:55 am, Mark McBride <mmcbr...@twitter.com> wrote:
> Note that search API whitelisting is different from regular API
> whitelisting, and getting a 20k hour limit there is much more
> restrictive.
>
> I still haven't seen a case where you couldn't do the matching on your
> side.  As John says, with the streaming API right now you can only
> match simple terms, so the complex terms aren't a factor.  In fact the
> track you posted won't actually function as you intend with the
> streaming API.  You could track for tweets containing starbucks or
> free.  But currently that's it.  "starbucks AND free" is something
> you'd have to implement on your side.  Same with near.
>
>
>
>
>
> On Mon, Dec 7, 2009 at 3:45 PM,Julien<julien.genest...@gmail.com> wrote:
> > Hum... ok... sad, but I have an idea. Please tell me if this is
> > stupid.
>
> > So, for each tweet I receive, I know what searches it _may_ match.
> > Right?
> > So, with all these "candidates" query, what I can do is perform them
> > against the regular search API (as long as they're complex). If the
> > result from the polling includes them, then, I know that the searches
> > matches and I don't have to build anything on top of what you built.
>
> > Let's take an example :
> > -  If I have a search for "starbuck AND free near:94123"
> > - I track "starbuck" with the streaming API
> > - Whenever you guys send me a tweet for this track
> > -  I check internally all the queries that may match Starbucks
> > - I perform them on your API
> > - if the tweet you sent me is in the results, then I know this tweet
> > is valid,
> > - if not, I discard it.
>
> > My only concern here is the 20k/hour limit. I think this is still
> > doable, because
> > 1) we will only make queries to the search API when we receive
> > notifications
> > 2) we will only make queries to the search API for complex queries
> > (IE : AND, +, "" or near:
>
> > The pros :
> > - whener you guys change/add stuff to your search DSL, I don't have to
> > change anything on my side.
>
> > How does that sound?
>
> > Thanks John anyway for your great help!
>
> >Julien
>
> > On Dec 5, 3:32 pm, John Kalucki <j...@twitter.com> wrote:
> >> This could only make sense if the Streaming API supported "search engine
> >> logic". Currently Streaming only supports keyword matching -- you have to
> >> post-process to add additional predicate operators beyond OR. You can
> >> reproduce the keyword match in a few lines of code, and the rest is
> >> (currently) all up to you anyway. Just remember that a given tweet could
> >> have triggered multiple predicates.
>
> >> Beyond being a low priority feature, rendering and delivering custom
> >> responses per user would be a performance risk. We currently can support a
> >> very large number of filter clients per server, and we want to preserve 
> >> this
> >> performance.
>
> >> -John Kaluckihttp://twitter.com/jkalucki
> >> Services, Twitter Inc.
>
> >> On Sat, Dec 5, 2009 at 3:18 AM,Julien<julien.genest...@gmail.com> wrote:
> >> > Thanks Dave,
>
> >> > I think I get it from your example... yet, in our case, we have
> >> > several thousands of keywords, and many many complex searches (with
> >> > filter:, "and", "or", :near ... an so on).
>
> >> > I keep thinking that instead of re-implementing on my side the search
> >> > engine logic that Twitter has, it would be simpler for them to also
> >> > send the macthing keywords. And even more elegant solution (yet
> >> > slightly more complex) would be to be able to parse parameters along
> >> > with the search I give, such as a unique search_id (that I can store
> >> > on my side) and then, instead of giving me the matched keywords/search
> >> > terms, they could just give me back that search_id. That would be
> >> > something like this :
>
> >> > Right now it is :
> >> > POST  http://stream.twitter.com/1/statuses/filter.json
> >> > track=paris,twitter+superfeedr,<http://stream.twitter.com/1/statuses/filter.json%0Atrack=paris,twitte...,>"julien
> >> > near:france"
>
> >> > It would be awesome if I could do :
> >> > POST  http://stream.twitter.com/1/statuses/filter.json
> >> > track={"paris":"my_search_1","twitter
> >> > +superfeedr":"my_search_2","juliennear:france":"my_search_3"}
>
> >> > And then, upon notifications, they would just pass me this search key
> >> > my_search_xx
>
> >> > I know and understand and implies a little bit of work for Twitter,
> >> > but it also removes the pain from each susbcriber to this streaming
> >> > API who has to re-implement again and again the "search engine" from
> >> > Twitter.
>
> >> > On Dec 4, 11:33 am, Dave Sherohman <d...@fishtwits.com> wrote:
> >> > > On Thu, Dec 03, 2009 at 03:12:05PM -0800,Julienwrote:
> >> > > > Well, then I'd need some help with that...
>
> >> > > > Again, it's easy with single search keywords, but I haven't found a
> >> > > > solution for combined searches like twitter+stream or photo+Paris...
> >> > > > because I would have to compare each combination of tokens in the
> >> > > > tweet...
>
> >> > > > Can someone give more details.
>
> >> > > I don't mean to be flogging my site today, but take a look
> >> > athttp://fishtwits.comfortheresults I'm producing (just click the logo
> >> > > at the top of the page to view the full site without logging in):  Any
> >> > > tweets from users followed by FishTwits are scanned for fishing-related
> >> > > terms and all such terms found in the tweet are displayed below it.  At
> >> > > this moment, for instance, the first displayed tweet shows matches for
> >> > > both "Fly Fishing" and "Sole".
>
> >> > > This is accomplished with the following Perl code (edited to remove
> >> > > parts which aren't directly relevant):
>
> >> > > sub load_from_text {
> >> > >   my ($class, $text) = @_;
>
> >> > >   unless($topic_regex) {
> >> > >     require Regexp::Assemble;
> >> > >     my $ra = Regexp::Assemble->new(
> >> > >                chomp => 0,
> >> > >                anchor_word_begin => 1,
> >> > >                anchor_word_end => 1,
> >> > >              );
> >> > >     for my $topic (@topic_list) {
> >> > >       $ra->add(lc $topic);
> >> > >     }
> >> > >     $topic_regex = $ra->re;
> >> > >   }
>
> >> > >   $text = lc $text;
> >> > >   my @topics = $text =~ /$topic_regex/g;
>
> >> > >   return sort @topics;
>
> >> > > }
>
> >> > > It first uses Regexp::Assemble to build a $topic_regex[1] which will
> >> > > match any of the words/phrases found in the topic table, then does a
> >> > > global match of $text (the body of the tweet being examined) against
> >> > > $topic_regex, capturing all matches into the array @topics, which is
> >> > > then sorted and returned to the caller.
>
> >> > > After the match is performed, @topics contains every search term which
> >> > > is matched, no matter how many there may be, which should fill your
> >> > > requirement for "combined searches", unless I'm misunderstanding it.
>
> >> > > If you mean you would want that "Fly Fishing", "Sole" tweet to return
> >> > > three hits rather than two ("Fly Fishing", "Sole", "Fly Fishing+Sole"),
> >> > > that's easy enough to create from @topics, just generate every
> >> > > permutation of the terms which the individual tweet matched.
>
> >> > > [1]  If you're only dealing with 10 or so keywords, you'd probably be
> >> > > just as well off building the regex by hand.  The main reason I'm using
> >> > > Regexp::Assemble to do it on the fly is because manually creating and
> >> > > then maintaining a regex that will efficiently match any of 1300 terms
> >> > > would be a nightmare.
>
> >> > > --
> >> > > Dave Sherohman
>
> --
>    ---Mark
>
> http://twitter.com/mccv

[twitter-dev] Re: Track streaming : how to match tweets?

Reply via email to