That'd definitely be appreciated. This is where I ended up today (from
a ruby perspective) and it seems to be getting the job done.

def tokenize(tweet)
  tokens = tweet["text"].split(/\s+/).map{|w| w.downcase}
  tokens |= tokens.map{|x| x.gsub(/[^\w]+/, " ").strip.split(/\s
+/)}.flatten
end

On Jan 10, 4:31 pm, John Kalucki <j...@twitter.com> wrote:
> The broader track matching is indeed confusing. It errs on the side of
> over-delivery. The assumption is that there is post-processing on the client
> end to perform the precise filtering required. I've added a note to take
> another pass at the documentation and the filtering.
>
> -John Kaluckihttp://twitter.com/jkalucki
> Services, Twitter Inc.
>
> On Sun, Jan 10, 2010 at 12:17 PM, Damon C <d.lifehac...@gmail.com> wrote:
> > OK, so it looks like I misunderstood the docs, as it relates to the
> > punctuation.
>
> > I understood this:
> > "Terms are exact-matched, and also exact-matched ignoring
> > punctuation."
> > to mean that if I provide a keyword with punctuation, the punctuation
> > will be ignored when matching. Some testing reveals that is not the
> > case. If I provide "omg!!" as a keyword, it will exact-match omg's
> > with two exclamation marks. If I provide just "omg", it will match
> > omg's, as well as omg's with exclamation marks.
>
> > That said, I'm still confused by the fact that "twitter" will match
> > "http://twitter.com"; when the docs say it won't. And I'm still
> > wondering what exactly Twitter defines as punctuation.
>
> > dpc
>
> > On Jan 10, 11:44 am, Damon C <d.lifehac...@gmail.com> wrote:
> > > This question is directed towards John, but happy to hear how other
> > > folks do it as well.
>
> > > I've got a couple questions regarding the tokenizing process on the
> > > streaming API. This would be remedied pretty easily with an example
> > > from Twitter as to their tokenizing process/regexp as I'm slightly
> > > confused what keywords would match. It would also be useful so I could
> > > know what tokens a specific status update will generate for an
> > > efficient hash lookup. i.e. if I do a simple split on [^\w]+ regexp,
> > > is that going to generate the correct set of tokens?
>
> > > In addition to the specifics of tokenizing, the docs state that the
> > > keyword Twitter will not match "twitter.com". In a quick bit of
> > > testing (with delicious url's, less traffic), it seems that the
> > > keyword "icio" will matchhttp://icio.us/. I also currently have open
> > > streams just matching portions of a domain and it appears to be
> > > working. The docs make it seem as if punctuation is matched, but what
> > > is defined as punctuation (by Twitter)? And if the docs are correct,
> > > how would one match "twitter.com"?
>
> > > Confused in Seattle,
>
> > > Damon/@dacort
>
>

Reply via email to