It's great to hear that someone implemented all this. There's a similar technique documented here: http://dev.twitter.com/pages/streaming_api_concepts, under By Language and Country. My suggestion was to start with a list of stop words to build your user corpus -- but I don't know how well Farsi works with track, so random sample method might indeed be better.
-John Kalucki http://twitter.com/jkalucki Infrastructure, Twitter Inc. 2010/7/3 Pascal Jürgens <lists.pascal.juerg...@googlemail.com> > Hi Lucas, > > as someone who approached a similar problem, my recommendation would be to > track users. In order to get results quickly (rather than every few hours > via user timeline calls), you need streaming access, which is a bit more > complicated. I implemented such a system in order to track the > german-speaking population of twitter users, and it works extremely well. > > 1) get access to the sample stream (5% or 15% type) (warning: the 15% > stream is ~10GB+ a day) > > 2) construct an efficient cascading language filter, ie: > - first test the computationally cheap AND precise attributes, such as a > list of known farsi-only keywords or the location box > - if those attribute tests are negative, perform more computationally > expensive tests > - if in doubt, count it as non-farsi! False positives will kill you if you > sample a very small population! > > 3) With said filter, identify the accounts using farsi > > 4) Perform a first-degree network sweep and scan all their > friends+followers, since those have a higher likelihood to speak farsi as > well > > 5) compile a list of those known users > > 6) track those users with the shadow role stream (80.000 users) or higher. > > If your language detection code is not efficient enough, you might want to > include a cheap, fast and precise negative filter of known non-farsi > attributes. Test that one before all the others and you should be able to > filter out a large part of the volume. > > > Don't hesitate to ask for any clarification! > > Pascal Juergens > Graduate Student / Mass Communication > U of Mainz, Germany > > On Jul 3, 2010, at 0:36 , Lucas Vickers wrote: > > > Hello, > > > > I am trying to create an app that will show tweets and trends in > > Farsi, for native speakers. I would like to somehow get a sample > > 'garden hose' of Farsi based tweets, but I am unable to come up with > > an elegant solution. > > > > I see the following options: > > > > - Sample all tweets, and run a language detection algorithm on the > > tweet to determine which are/could be Farsi. > > * Problem: only a very very small % of the tweets will be in Farsi > > > > - Use the location filter to try and sample tweets from countries that > > are known to speak Farsi, and then run a language detection algorithm > > on the tweets. > > * Problem: I seem to be limited on the size of the coordinate box I > > can provide. I can not even cover all of Iran for example. > > > > - Filter a standard farsi term. > > * Problem: will limit my results to only tweets with this term > > > > - Search for laguage = farsi > > * Problem: Not a stream, I will need to keep searching. > > > > I think of the given options I mentioned what makes the most sense is > > to search for tweets where language=farsi, and use the since_id to > > keep my results new. Given this method, I have three questions > > 1 - since_id I imagine is the highest tweet_id from the previous > > result set? > > 2 - How often can I search (given API limits of course) in order to > > ensure I get new data? > > 3 - Will the language filter provide me with users who's default > > language is farsi, or will it actually find tweets in farsi? > > > > I am aware that the user can select their native language in the user > > profile, but I also know this is not 100% reliable. > > > > Can anyone think of a more elegant solution? > > Are there any hidden/experimental language type filters available to > > us? > > > > Thanks! > > Lucas > >