It's great to hear that someone implemented all this. There's a similar
technique documented here:
http://dev.twitter.com/pages/streaming_api_concepts, under By Language and
Country. My suggestion was to start with a list of stop words to build your
user corpus -- but I don't know how well Farsi works with track, so random
sample method might indeed be better.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.




2010/7/3 Pascal Jürgens <lists.pascal.juerg...@googlemail.com>

> Hi Lucas,
>
> as someone who approached a similar problem, my recommendation would be to
> track users.  In order to get results quickly (rather than every few hours
> via user timeline calls), you need streaming access, which is a bit more
> complicated. I implemented such a system in order to track the
> german-speaking population of twitter users, and it works extremely well.
>
> 1) get access to the sample stream (5% or 15% type) (warning: the 15%
> stream is ~10GB+ a day)
>
> 2) construct an efficient cascading language filter, ie:
> - first test the computationally cheap AND precise attributes, such as a
> list of known farsi-only keywords or the location box
> - if those attribute tests are negative, perform more computationally
> expensive tests
> - if in doubt, count it as non-farsi! False positives will kill you if you
> sample a very small population!
>
> 3) With said filter, identify the accounts using farsi
>
> 4) Perform a first-degree network sweep and scan all their
> friends+followers, since those have a higher likelihood to speak farsi as
> well
>
> 5) compile a list of those known users
>
> 6) track those users with the shadow role stream (80.000 users) or higher.
>
> If your language detection code is not efficient enough, you might want to
> include a cheap, fast and precise negative filter of known non-farsi
> attributes. Test that one before all the others and you should be able to
> filter out a large part of the volume.
>
>
> Don't hesitate to ask for any clarification!
>
> Pascal Juergens
> Graduate Student / Mass Communication
> U of Mainz, Germany
>
> On Jul 3, 2010, at 0:36 , Lucas Vickers wrote:
>
> > Hello,
> >
> > I am trying to create an app that will show tweets and trends in
> > Farsi, for native speakers.  I would like to somehow get a sample
> > 'garden hose' of Farsi based tweets, but I am unable to come up with
> > an elegant solution.
> >
> > I see the following options:
> >
> > - Sample all tweets, and run a language detection algorithm on the
> > tweet to determine which are/could be Farsi.
> >  * Problem: only a very very small % of the tweets will be in Farsi
> >
> > - Use the location filter to try and sample tweets from countries that
> > are known to speak Farsi, and then run a language detection algorithm
> > on the tweets.
> >  * Problem: I seem to be limited on the size of the coordinate box I
> > can provide.  I can not even cover all of Iran for example.
> >
> > - Filter a standard farsi term.
> >  * Problem: will limit my results to only tweets with this term
> >
> > - Search for laguage = farsi
> >   * Problem: Not a stream, I will need to keep searching.
> >
> > I think of the given options I mentioned what makes the most sense is
> > to search for tweets where language=farsi, and use the since_id to
> > keep my results new.  Given this method, I have three questions
> > 1 - since_id I imagine is the highest tweet_id from the previous
> > result set?
> > 2 - How often can I search (given API limits of course) in order to
> > ensure I get new data?
> > 3 - Will the language filter provide me with users who's default
> > language is farsi, or will it actually find tweets in farsi?
> >
> > I am aware that the user can select their native language in the user
> > profile, but I also know this is not 100% reliable.
> >
> > Can anyone think of a more elegant solution?
> > Are there any hidden/experimental language type filters available to
> > us?
> >
> > Thanks!
> > Lucas
>
>

Reply via email to