Hi Lucas,

as someone who approached a similar problem, my recommendation would be to 
track users.  In order to get results quickly (rather than every few hours via 
user timeline calls), you need streaming access, which is a bit more 
complicated. I implemented such a system in order to track the german-speaking 
population of twitter users, and it works extremely well.

1) get access to the sample stream (5% or 15% type) (warning: the 15% stream is 
~10GB+ a day)

2) construct an efficient cascading language filter, ie:
- first test the computationally cheap AND precise attributes, such as a list 
of known farsi-only keywords or the location box
- if those attribute tests are negative, perform more computationally expensive 
tests
- if in doubt, count it as non-farsi! False positives will kill you if you 
sample a very small population!

3) With said filter, identify the accounts using farsi

4) Perform a first-degree network sweep and scan all their friends+followers, 
since those have a higher likelihood to speak farsi as well

5) compile a list of those known users

6) track those users with the shadow role stream (80.000 users) or higher.

If your language detection code is not efficient enough, you might want to 
include a cheap, fast and precise negative filter of known non-farsi 
attributes. Test that one before all the others and you should be able to 
filter out a large part of the volume.


Don't hesitate to ask for any clarification!

Pascal Juergens
Graduate Student / Mass Communication
U of Mainz, Germany

On Jul 3, 2010, at 0:36 , Lucas Vickers wrote:

> Hello,
> 
> I am trying to create an app that will show tweets and trends in
> Farsi, for native speakers.  I would like to somehow get a sample
> 'garden hose' of Farsi based tweets, but I am unable to come up with
> an elegant solution.
> 
> I see the following options:
> 
> - Sample all tweets, and run a language detection algorithm on the
> tweet to determine which are/could be Farsi.
>  * Problem: only a very very small % of the tweets will be in Farsi
> 
> - Use the location filter to try and sample tweets from countries that
> are known to speak Farsi, and then run a language detection algorithm
> on the tweets.
>  * Problem: I seem to be limited on the size of the coordinate box I
> can provide.  I can not even cover all of Iran for example.
> 
> - Filter a standard farsi term.
>  * Problem: will limit my results to only tweets with this term
> 
> - Search for laguage = farsi
>   * Problem: Not a stream, I will need to keep searching.
> 
> I think of the given options I mentioned what makes the most sense is
> to search for tweets where language=farsi, and use the since_id to
> keep my results new.  Given this method, I have three questions
> 1 - since_id I imagine is the highest tweet_id from the previous
> result set?
> 2 - How often can I search (given API limits of course) in order to
> ensure I get new data?
> 3 - Will the language filter provide me with users who's default
> language is farsi, or will it actually find tweets in farsi?
> 
> I am aware that the user can select their native language in the user
> profile, but I also know this is not 100% reliable.
> 
> Can anyone think of a more elegant solution?
> Are there any hidden/experimental language type filters available to
> us?
> 
> Thanks!
> Lucas

Reply via email to