[twitter-dev] Re: Track streaming : how to match tweets?

2009-12-08 Thread Julien
Thanks Mark, but as I said, we need to fetch more complex feeds to. So
we'll use the OR with the simple query, and then query the search API
with the complex query to see if a given tweet matches what we need!

Julien

On Dec 8, 12:55 am, Mark McBride mmcbr...@twitter.com wrote:
 Note that search API whitelisting is different from regular API
 whitelisting, and getting a 20k hour limit there is much more
 restrictive.

 I still haven't seen a case where you couldn't do the matching on your
 side.  As John says, with the streaming API right now you can only
 match simple terms, so the complex terms aren't a factor.  In fact the
 track you posted won't actually function as you intend with the
 streaming API.  You could track for tweets containing starbucks or
 free.  But currently that's it.  starbucks AND free is something
 you'd have to implement on your side.  Same with near.





 On Mon, Dec 7, 2009 at 3:45 PM,Julienjulien.genest...@gmail.com wrote:
  Hum... ok... sad, but I have an idea. Please tell me if this is
  stupid.

  So, for each tweet I receive, I know what searches it _may_ match.
  Right?
  So, with all these candidates query, what I can do is perform them
  against the regular search API (as long as they're complex). If the
  result from the polling includes them, then, I know that the searches
  matches and I don't have to build anything on top of what you built.

  Let's take an example :
  -  If I have a search for starbuck AND free near:94123
  - I track starbuck with the streaming API
  - Whenever you guys send me a tweet for this track
  -  I check internally all the queries that may match Starbucks
  - I perform them on your API
  - if the tweet you sent me is in the results, then I know this tweet
  is valid,
  - if not, I discard it.

  My only concern here is the 20k/hour limit. I think this is still
  doable, because
  1) we will only make queries to the search API when we receive
  notifications
  2) we will only make queries to the search API for complex queries
  (IE : AND, +,  or near:

  The pros :
  - whener you guys change/add stuff to your search DSL, I don't have to
  change anything on my side.

  How does that sound?

  Thanks John anyway for your great help!

 Julien

  On Dec 5, 3:32 pm, John Kalucki j...@twitter.com wrote:
  This could only make sense if the Streaming API supported search engine
  logic. Currently Streaming only supports keyword matching -- you have to
  post-process to add additional predicate operators beyond OR. You can
  reproduce the keyword match in a few lines of code, and the rest is
  (currently) all up to you anyway. Just remember that a given tweet could
  have triggered multiple predicates.

  Beyond being a low priority feature, rendering and delivering custom
  responses per user would be a performance risk. We currently can support a
  very large number of filter clients per server, and we want to preserve 
  this
  performance.

  -John Kaluckihttp://twitter.com/jkalucki
  Services, Twitter Inc.

  On Sat, Dec 5, 2009 at 3:18 AM,Julienjulien.genest...@gmail.com wrote:
   Thanks Dave,

   I think I get it from your example... yet, in our case, we have
   several thousands of keywords, and many many complex searches (with
   filter:, and, or, :near ... an so on).

   I keep thinking that instead of re-implementing on my side the search
   engine logic that Twitter has, it would be simpler for them to also
   send the macthing keywords. And even more elegant solution (yet
   slightly more complex) would be to be able to parse parameters along
   with the search I give, such as a unique search_id (that I can store
   on my side) and then, instead of giving me the matched keywords/search
   terms, they could just give me back that search_id. That would be
   something like this :

   Right now it is :
   POST  http://stream.twitter.com/1/statuses/filter.json
   track=paris,twitter+superfeedr,http://stream.twitter.com/1/statuses/filter.json%0Atrack=paris,twitte...,julien
   near:france

   It would be awesome if I could do :
   POST  http://stream.twitter.com/1/statuses/filter.json
   track={paris:my_search_1,twitter
   +superfeedr:my_search_2,juliennear:france:my_search_3}

   And then, upon notifications, they would just pass me this search key
   my_search_xx

   I know and understand and implies a little bit of work for Twitter,
   but it also removes the pain from each susbcriber to this streaming
   API who has to re-implement again and again the search engine from
   Twitter.

   On Dec 4, 11:33 am, Dave Sherohman d...@fishtwits.com wrote:
On Thu, Dec 03, 2009 at 03:12:05PM -0800,Julienwrote:
 Well, then I'd need some help with that...

 Again, it's easy with single search keywords, but I haven't found a
 solution for combined searches like twitter+stream or photo+Paris...
 because I would have to compare each combination of tokens in the
 tweet...

 Can someone give more details.

I don't mean to 

[twitter-dev] Re: Track streaming : how to match tweets?

2009-12-07 Thread Julien
Hum... ok... sad, but I have an idea. Please tell me if this is
stupid.

So, for each tweet I receive, I know what searches it _may_ match.
Right?
So, with all these candidates query, what I can do is perform them
against the regular search API (as long as they're complex). If the
result from the polling includes them, then, I know that the searches
matches and I don't have to build anything on top of what you built.

Let's take an example :
-  If I have a search for starbuck AND free near:94123
- I track starbuck with the streaming API
- Whenever you guys send me a tweet for this track
-  I check internally all the queries that may match Starbucks
- I perform them on your API
- if the tweet you sent me is in the results, then I know this tweet
is valid,
- if not, I discard it.

My only concern here is the 20k/hour limit. I think this is still
doable, because
1) we will only make queries to the search API when we receive
notifications
2) we will only make queries to the search API for complex queries
(IE : AND, +,  or near:

The pros :
- whener you guys change/add stuff to your search DSL, I don't have to
change anything on my side.

How does that sound?

Thanks John anyway for your great help!

Julien


On Dec 5, 3:32 pm, John Kalucki j...@twitter.com wrote:
 This could only make sense if the Streaming API supported search engine
 logic. Currently Streaming only supports keyword matching -- you have to
 post-process to add additional predicate operators beyond OR. You can
 reproduce the keyword match in a few lines of code, and the rest is
 (currently) all up to you anyway. Just remember that a given tweet could
 have triggered multiple predicates.

 Beyond being a low priority feature, rendering and delivering custom
 responses per user would be a performance risk. We currently can support a
 very large number of filter clients per server, and we want to preserve this
 performance.

 -John Kaluckihttp://twitter.com/jkalucki
 Services, Twitter Inc.



 On Sat, Dec 5, 2009 at 3:18 AM, Julien julien.genest...@gmail.com wrote:
  Thanks Dave,

  I think I get it from your example... yet, in our case, we have
  several thousands of keywords, and many many complex searches (with
  filter:, and, or, :near ... an so on).

  I keep thinking that instead of re-implementing on my side the search
  engine logic that Twitter has, it would be simpler for them to also
  send the macthing keywords. And even more elegant solution (yet
  slightly more complex) would be to be able to parse parameters along
  with the search I give, such as a unique search_id (that I can store
  on my side) and then, instead of giving me the matched keywords/search
  terms, they could just give me back that search_id. That would be
  something like this :

  Right now it is :
  POST  http://stream.twitter.com/1/statuses/filter.json
  track=paris,twitter+superfeedr,http://stream.twitter.com/1/statuses/filter.json%0Atrack=paris,twitte...,julien
  near:france

  It would be awesome if I could do :
  POST  http://stream.twitter.com/1/statuses/filter.json
  track={paris:my_search_1,twitter
  +superfeedr:my_search_2,julien near:france:my_search_3}

  And then, upon notifications, they would just pass me this search key
  my_search_xx

  I know and understand and implies a little bit of work for Twitter,
  but it also removes the pain from each susbcriber to this streaming
  API who has to re-implement again and again the search engine from
  Twitter.

  On Dec 4, 11:33 am, Dave Sherohman d...@fishtwits.com wrote:
   On Thu, Dec 03, 2009 at 03:12:05PM -0800, Julien wrote:
Well, then I'd need some help with that...

Again, it's easy with single search keywords, but I haven't found a
solution for combined searches like twitter+stream or photo+Paris...
because I would have to compare each combination of tokens in the
tweet...

Can someone give more details.

   I don't mean to be flogging my site today, but take a look
  athttp://fishtwits.comforthe results I'm producing (just click the logo
   at the top of the page to view the full site without logging in):  Any
   tweets from users followed by FishTwits are scanned for fishing-related
   terms and all such terms found in the tweet are displayed below it.  At
   this moment, for instance, the first displayed tweet shows matches for
   both Fly Fishing and Sole.

   This is accomplished with the following Perl code (edited to remove
   parts which aren't directly relevant):

   sub load_from_text {
     my ($class, $text) = @_;

     unless($topic_regex) {
       require Regexp::Assemble;
       my $ra = Regexp::Assemble-new(
                  chomp = 0,
                  anchor_word_begin = 1,
                  anchor_word_end = 1,
                );
       for my $topic (@topic_list) {
         $ra-add(lc $topic);
       }
       $topic_regex = $ra-re;
     }

     $text = lc $text;
     my @topics = $text =~ /$topic_regex/g;

     return sort @topics;

   }

   It first 

Re: [twitter-dev] Re: Track streaming : how to match tweets?

2009-12-07 Thread Mark McBride
Note that search API whitelisting is different from regular API
whitelisting, and getting a 20k hour limit there is much more
restrictive.

I still haven't seen a case where you couldn't do the matching on your
side.  As John says, with the streaming API right now you can only
match simple terms, so the complex terms aren't a factor.  In fact the
track you posted won't actually function as you intend with the
streaming API.  You could track for tweets containing starbucks or
free.  But currently that's it.  starbucks AND free is something
you'd have to implement on your side.  Same with near.

On Mon, Dec 7, 2009 at 3:45 PM, Julien julien.genest...@gmail.com wrote:
 Hum... ok... sad, but I have an idea. Please tell me if this is
 stupid.

 So, for each tweet I receive, I know what searches it _may_ match.
 Right?
 So, with all these candidates query, what I can do is perform them
 against the regular search API (as long as they're complex). If the
 result from the polling includes them, then, I know that the searches
 matches and I don't have to build anything on top of what you built.

 Let's take an example :
 -  If I have a search for starbuck AND free near:94123
 - I track starbuck with the streaming API
 - Whenever you guys send me a tweet for this track
 -  I check internally all the queries that may match Starbucks
 - I perform them on your API
 - if the tweet you sent me is in the results, then I know this tweet
 is valid,
 - if not, I discard it.

 My only concern here is the 20k/hour limit. I think this is still
 doable, because
 1) we will only make queries to the search API when we receive
 notifications
 2) we will only make queries to the search API for complex queries
 (IE : AND, +,  or near:

 The pros :
 - whener you guys change/add stuff to your search DSL, I don't have to
 change anything on my side.

 How does that sound?

 Thanks John anyway for your great help!

 Julien


 On Dec 5, 3:32 pm, John Kalucki j...@twitter.com wrote:
 This could only make sense if the Streaming API supported search engine
 logic. Currently Streaming only supports keyword matching -- you have to
 post-process to add additional predicate operators beyond OR. You can
 reproduce the keyword match in a few lines of code, and the rest is
 (currently) all up to you anyway. Just remember that a given tweet could
 have triggered multiple predicates.

 Beyond being a low priority feature, rendering and delivering custom
 responses per user would be a performance risk. We currently can support a
 very large number of filter clients per server, and we want to preserve this
 performance.

 -John Kaluckihttp://twitter.com/jkalucki
 Services, Twitter Inc.



 On Sat, Dec 5, 2009 at 3:18 AM, Julien julien.genest...@gmail.com wrote:
  Thanks Dave,

  I think I get it from your example... yet, in our case, we have
  several thousands of keywords, and many many complex searches (with
  filter:, and, or, :near ... an so on).

  I keep thinking that instead of re-implementing on my side the search
  engine logic that Twitter has, it would be simpler for them to also
  send the macthing keywords. And even more elegant solution (yet
  slightly more complex) would be to be able to parse parameters along
  with the search I give, such as a unique search_id (that I can store
  on my side) and then, instead of giving me the matched keywords/search
  terms, they could just give me back that search_id. That would be
  something like this :

  Right now it is :
  POST  http://stream.twitter.com/1/statuses/filter.json
  track=paris,twitter+superfeedr,http://stream.twitter.com/1/statuses/filter.json%0Atrack=paris,twitte...,julien
  near:france

  It would be awesome if I could do :
  POST  http://stream.twitter.com/1/statuses/filter.json
  track={paris:my_search_1,twitter
  +superfeedr:my_search_2,julien near:france:my_search_3}

  And then, upon notifications, they would just pass me this search key
  my_search_xx

  I know and understand and implies a little bit of work for Twitter,
  but it also removes the pain from each susbcriber to this streaming
  API who has to re-implement again and again the search engine from
  Twitter.

  On Dec 4, 11:33 am, Dave Sherohman d...@fishtwits.com wrote:
   On Thu, Dec 03, 2009 at 03:12:05PM -0800, Julien wrote:
Well, then I'd need some help with that...

Again, it's easy with single search keywords, but I haven't found a
solution for combined searches like twitter+stream or photo+Paris...
because I would have to compare each combination of tokens in the
tweet...

Can someone give more details.

   I don't mean to be flogging my site today, but take a look
  athttp://fishtwits.comforthe results I'm producing (just click the logo
   at the top of the page to view the full site without logging in):  Any
   tweets from users followed by FishTwits are scanned for fishing-related
   terms and all such terms found in the tweet are displayed below it.  At
   this moment, for 

[twitter-dev] Re: Track streaming : how to match tweets?

2009-12-05 Thread Julien
Thanks Dave,

I think I get it from your example... yet, in our case, we have
several thousands of keywords, and many many complex searches (with
filter:, and, or, :near ... an so on).

I keep thinking that instead of re-implementing on my side the search
engine logic that Twitter has, it would be simpler for them to also
send the macthing keywords. And even more elegant solution (yet
slightly more complex) would be to be able to parse parameters along
with the search I give, such as a unique search_id (that I can store
on my side) and then, instead of giving me the matched keywords/search
terms, they could just give me back that search_id. That would be
something like this :

Right now it is :
POST  http://stream.twitter.com/1/statuses/filter.json
track=paris,twitter+superfeedr,julien near:france

It would be awesome if I could do :
POST  http://stream.twitter.com/1/statuses/filter.json
track={paris:my_search_1,twitter
+superfeedr:my_search_2,julien near:france:my_search_3}

And then, upon notifications, they would just pass me this search key
my_search_xx

I know and understand and implies a little bit of work for Twitter,
but it also removes the pain from each susbcriber to this streaming
API who has to re-implement again and again the search engine from
Twitter.






On Dec 4, 11:33 am, Dave Sherohman d...@fishtwits.com wrote:
 On Thu, Dec 03, 2009 at 03:12:05PM -0800, Julien wrote:
  Well, then I'd need some help with that...

  Again, it's easy with single search keywords, but I haven't found a
  solution for combined searches like twitter+stream or photo+Paris...
  because I would have to compare each combination of tokens in the
  tweet...

  Can someone give more details.

 I don't mean to be flogging my site today, but take a look 
 athttp://fishtwits.comfor the results I'm producing (just click the logo
 at the top of the page to view the full site without logging in):  Any
 tweets from users followed by FishTwits are scanned for fishing-related
 terms and all such terms found in the tweet are displayed below it.  At
 this moment, for instance, the first displayed tweet shows matches for
 both Fly Fishing and Sole.

 This is accomplished with the following Perl code (edited to remove
 parts which aren't directly relevant):

 sub load_from_text {
   my ($class, $text) = @_;

   unless($topic_regex) {
     require Regexp::Assemble;
     my $ra = Regexp::Assemble-new(
                chomp = 0,
                anchor_word_begin = 1,
                anchor_word_end = 1,
              );
     for my $topic (@topic_list) {
       $ra-add(lc $topic);
     }
     $topic_regex = $ra-re;
   }

   $text = lc $text;
   my @topics = $text =~ /$topic_regex/g;

   return sort @topics;

 }

 It first uses Regexp::Assemble to build a $topic_regex[1] which will
 match any of the words/phrases found in the topic table, then does a
 global match of $text (the body of the tweet being examined) against
 $topic_regex, capturing all matches into the array @topics, which is
 then sorted and returned to the caller.

 After the match is performed, @topics contains every search term which
 is matched, no matter how many there may be, which should fill your
 requirement for combined searches, unless I'm misunderstanding it.

 If you mean you would want that Fly Fishing, Sole tweet to return
 three hits rather than two (Fly Fishing, Sole, Fly Fishing+Sole),
 that's easy enough to create from @topics, just generate every
 permutation of the terms which the individual tweet matched.

 [1]  If you're only dealing with 10 or so keywords, you'd probably be
 just as well off building the regex by hand.  The main reason I'm using
 Regexp::Assemble to do it on the fly is because manually creating and
 then maintaining a regex that will efficiently match any of 1300 terms
 would be a nightmare.

 --
 Dave Sherohman


Re: [twitter-dev] Re: Track streaming : how to match tweets?

2009-12-04 Thread Dave Sherohman
On Thu, Dec 03, 2009 at 03:12:05PM -0800, Julien wrote:
 Well, then I'd need some help with that...
 
 Again, it's easy with single search keywords, but I haven't found a
 solution for combined searches like twitter+stream or photo+Paris...
 because I would have to compare each combination of tokens in the
 tweet...
 
 Can someone give more details.

I don't mean to be flogging my site today, but take a look at
http://fishtwits.com for the results I'm producing (just click the logo
at the top of the page to view the full site without logging in):  Any
tweets from users followed by FishTwits are scanned for fishing-related
terms and all such terms found in the tweet are displayed below it.  At
this moment, for instance, the first displayed tweet shows matches for
both Fly Fishing and Sole.

This is accomplished with the following Perl code (edited to remove
parts which aren't directly relevant):

sub load_from_text {
  my ($class, $text) = @_;

  unless($topic_regex) {
require Regexp::Assemble;
my $ra = Regexp::Assemble-new(
   chomp = 0,
   anchor_word_begin = 1,
   anchor_word_end = 1,
 );
for my $topic (@topic_list) {
  $ra-add(lc $topic);
}
$topic_regex = $ra-re;
  }

  $text = lc $text;
  my @topics = $text =~ /$topic_regex/g;

  return sort @topics;
}

It first uses Regexp::Assemble to build a $topic_regex[1] which will
match any of the words/phrases found in the topic table, then does a
global match of $text (the body of the tweet being examined) against
$topic_regex, capturing all matches into the array @topics, which is
then sorted and returned to the caller.

After the match is performed, @topics contains every search term which
is matched, no matter how many there may be, which should fill your
requirement for combined searches, unless I'm misunderstanding it.

If you mean you would want that Fly Fishing, Sole tweet to return
three hits rather than two (Fly Fishing, Sole, Fly Fishing+Sole),
that's easy enough to create from @topics, just generate every
permutation of the terms which the individual tweet matched.


[1]  If you're only dealing with 10 or so keywords, you'd probably be
just as well off building the regex by hand.  The main reason I'm using
Regexp::Assemble to do it on the fly is because manually creating and
then maintaining a regex that will efficiently match any of 1300 terms
would be a nightmare.

-- 
Dave Sherohman


[twitter-dev] Re: Track streaming : how to match tweets?

2009-12-03 Thread Julien
Well, then I'd need some help with that...

Again, it's easy with single search keywords, but I haven't found a
solution for combined searches like twitter+stream or photo+Paris...
because I would have to compare each combination of tokens in the
tweet...

Can someone give more details.

I am not sure why I'd still need to match the keywords on my side
either... if you cna tell me which ones it matches.

Thanks,



On Dec 3, 9:05 am, Dave Sherohman d...@fishtwits.com wrote:
 On Wed, Dec 02, 2009 at 03:15:21PM -0800, Julien wrote:
  If I get a tweet, the only way to know what keyword it matches is to
  compare all of its words to the words I'm tracking... (mayvbe there is
  something easier).

  That's quite hard but it becomes harder if I add operands. Say I
  have a search romeo+juliet. When I get a tweet, I need to compare it
  to all the keywords, plus all the combinations :/ Technically that is
  not even doable if i have more than 10 keywords, since there are a LOT
  of combinations possible.

 You are mistaken.  Provided you have appropriate support from your
 language or its libraries, accomplishing this is trivial.  Using Perl
 and Regexp::Assemble, FishTwits is currently tracking 1,358 words/
 phrases and, for each tweet, building a list of which words/phrases
 appear in that tweet.  It's very doable (quick, even), despite having
 far more than 10 keywords involved.

 --
 Dave Sherohman