We have implemented the Turkish version: Twitturk
http://twitturk.com/home/lang/en

We skipped the first three steps but started with a few Turkish users and
crawled all the network and for each new user we tested if the description
or latest tweets are in Turkish language.

We have almost 100.000 Turkish users identified so far.

Using stream api we collect their tweets and we find out the popular people
and key-words, top tweets (most retweeted ones) among Turkish people.


2010/7/3 Pascal Jürgens <lists.pascal.juerg...@googlemail.com>

> Hi Lucas,
>
> as someone who approached a similar problem, my recommendation would be to
> track users.  In order to get results quickly (rather than every few hours
> via user timeline calls), you need streaming access, which is a bit more
> complicated. I implemented such a system in order to track the
> german-speaking population of twitter users, and it works extremely well.
>
> 1) get access to the sample stream (5% or 15% type) (warning: the 15%
> stream is ~10GB+ a day)
>
> 2) construct an efficient cascading language filter, ie:
> - first test the computationally cheap AND precise attributes, such as a
> list of known farsi-only keywords or the location box
> - if those attribute tests are negative, perform more computationally
> expensive tests
> - if in doubt, count it as non-farsi! False positives will kill you if you
> sample a very small population!
>
> 3) With said filter, identify the accounts using farsi
>
> 4) Perform a first-degree network sweep and scan all their
> friends+followers, since those have a higher likelihood to speak farsi as
> well
>
> 5) compile a list of those known users
>
> 6) track those users with the shadow role stream (80.000 users) or higher.
>
> If your language detection code is not efficient enough, you might want to
> include a cheap, fast and precise negative filter of known non-farsi
> attributes. Test that one before all the others and you should be able to
> filter out a large part of the volume.
>
>
> Don't hesitate to ask for any clarification!
>
> Pascal Juergens
> Graduate Student / Mass Communication
> U of Mainz, Germany
>
> On Jul 3, 2010, at 0:36 , Lucas Vickers wrote:
>
> > Hello,
> >
> > I am trying to create an app that will show tweets and trends in
> > Farsi, for native speakers.  I would like to somehow get a sample
> > 'garden hose' of Farsi based tweets, but I am unable to come up with
> > an elegant solution.
> >
> > I see the following options:
> >
> > - Sample all tweets, and run a language detection algorithm on the
> > tweet to determine which are/could be Farsi.
> >  * Problem: only a very very small % of the tweets will be in Farsi
> >
> > - Use the location filter to try and sample tweets from countries that
> > are known to speak Farsi, and then run a language detection algorithm
> > on the tweets.
> >  * Problem: I seem to be limited on the size of the coordinate box I
> > can provide.  I can not even cover all of Iran for example.
> >
> > - Filter a standard farsi term.
> >  * Problem: will limit my results to only tweets with this term
> >
> > - Search for laguage = farsi
> >   * Problem: Not a stream, I will need to keep searching.
> >
> > I think of the given options I mentioned what makes the most sense is
> > to search for tweets where language=farsi, and use the since_id to
> > keep my results new.  Given this method, I have three questions
> > 1 - since_id I imagine is the highest tweet_id from the previous
> > result set?
> > 2 - How often can I search (given API limits of course) in order to
> > ensure I get new data?
> > 3 - Will the language filter provide me with users who's default
> > language is farsi, or will it actually find tweets in farsi?
> >
> > I am aware that the user can select their native language in the user
> > profile, but I also know this is not 100% reliable.
> >
> > Can anyone think of a more elegant solution?
> > Are there any hidden/experimental language type filters available to
> > us?
> >
> > Thanks!
> > Lucas
>
>


-- 
Furkan Kuru

Reply via email to