Re: [twitter-dev] Farsi Twitter App

2010-07-06 Thread Lucas Vickers
Thank you everyone.

You've given me quite a few good options to look into.

Lucas

On Mon, Jul 5, 2010 at 5:57 AM, Jean-Charles Campagne  
wrote:
> Hello Lucas,
>
> We do not provide, yet, exactly what you are looking for, but for now
> we might help you on the language filtering part.
> We provide an API for language and location filtering for
> micro-messages (Tweets and Facebook messages, etc.).
>
> You'll find more info on the API website: http://developer.semiocast.com
>
> Regarding the feature you are looking for, we made a request to
> Twitter to be able to redistribute a "filtered API", so we will be
> able to provide something closer to what you are looking for. You can,
> more or less, achieve the same today with our current state of the API
> but it'll be more plumbing on your side.
>
>
> Best regards,
> Jean-Charles Campagne
> Semiocast
>
> On Sat, Jul 3, 2010 at 12:36 AM, Lucas Vickers  wrote:
>> Hello,
>>
>> I am trying to create an app that will show tweets and trends in
>> Farsi, for native speakers.  I would like to somehow get a sample
>> 'garden hose' of Farsi based tweets, but I am unable to come up with
>> an elegant solution.
>>
>


Re: [twitter-dev] Farsi Twitter App

2010-07-05 Thread Jean-Charles Campagne
Hello Lucas,

We do not provide, yet, exactly what you are looking for, but for now
we might help you on the language filtering part.
We provide an API for language and location filtering for
micro-messages (Tweets and Facebook messages, etc.).

You'll find more info on the API website: http://developer.semiocast.com

Regarding the feature you are looking for, we made a request to
Twitter to be able to redistribute a "filtered API", so we will be
able to provide something closer to what you are looking for. You can,
more or less, achieve the same today with our current state of the API
but it'll be more plumbing on your side.


Best regards,
Jean-Charles Campagne
Semiocast

On Sat, Jul 3, 2010 at 12:36 AM, Lucas Vickers  wrote:
> Hello,
>
> I am trying to create an app that will show tweets and trends in
> Farsi, for native speakers.  I would like to somehow get a sample
> 'garden hose' of Farsi based tweets, but I am unable to come up with
> an elegant solution.
>


Re: [twitter-dev] Farsi Twitter App

2010-07-04 Thread Furkan Kuru
You are right. Separate subpopulation s are out of our reach.

Apart from following/friendship connection we look at mentions and follow
them as well.
If a new comer or a man from other population mentions one of the people in
our network, his tweet will reach us and we can test him and add as well.

Thank you, I will look at the paper.


2010/7/4 Pascal Jürgens 

> Interesting. Your method is similar to the breadth-first crawl that many
> people do (for example, see the academic paper by Kwak et al. 2010).
>
> You have to keep in mind, however, that you are only crawling the giant
> component of the network, the connected part. If there are any turkish users
> who have their *separate* subpopulation, which is not connected to the rest,
> you won't find those.
>
> You could easily find those with a sample stream. Although I have to admit
> that the number of non-connected users is not so big, no one has really
> tested that so far.
>
> Pascal
>
> On Jul 3, 2010, at 20:00 , Furkan Kuru wrote:
>
> We have implemented the Turkish version: Twitturk
> http://twitturk.com/home/lang/en
>
>
> We skipped the first three steps but started with a few Turkish users and
> crawled all the network and for each new user we tested if the description
> or latest tweets are in Turkish language.
>
> We have almost 100.000 Turkish users identified so far.
>
> Using stream api we collect their tweets and we find out the popular people
> and key-words, top tweets (most retweeted ones) among Turkish people.
>
>
>


-- 
Furkan Kuru


Re: [twitter-dev] Farsi Twitter App

2010-07-04 Thread Pascal Jürgens
Interesting. Your method is similar to the breadth-first crawl that many people 
do (for example, see the academic paper by Kwak et al. 2010).

You have to keep in mind, however, that you are only crawling the giant 
component of the network, the connected part. If there are any turkish users 
who have their *separate* subpopulation, which is not connected to the rest, 
you won't find those.

You could easily find those with a sample stream. Although I have to admit that 
the number of non-connected users is not so big, no one has really tested that 
so far.

Pascal

On Jul 3, 2010, at 20:00 , Furkan Kuru wrote:

> We have implemented the Turkish version: 
> Twitturkhttp://twitturk.com/home/lang/en
> 
> We skipped the first three steps but started with a few Turkish users and 
> crawled all the network and for each new user we tested if the description or 
> latest tweets are in Turkish language.
> 
> We have almost 100.000 Turkish users identified so far.
> 
> Using stream api we collect their tweets and we find out the popular people 
> and key-words, top tweets (most retweeted ones) among Turkish people.



Re: [twitter-dev] Farsi Twitter App

2010-07-03 Thread Furkan Kuru
We have implemented the Turkish version: Twitturk
http://twitturk.com/home/lang/en

We skipped the first three steps but started with a few Turkish users and
crawled all the network and for each new user we tested if the description
or latest tweets are in Turkish language.

We have almost 100.000 Turkish users identified so far.

Using stream api we collect their tweets and we find out the popular people
and key-words, top tweets (most retweeted ones) among Turkish people.


2010/7/3 Pascal Jürgens 

> Hi Lucas,
>
> as someone who approached a similar problem, my recommendation would be to
> track users.  In order to get results quickly (rather than every few hours
> via user timeline calls), you need streaming access, which is a bit more
> complicated. I implemented such a system in order to track the
> german-speaking population of twitter users, and it works extremely well.
>
> 1) get access to the sample stream (5% or 15% type) (warning: the 15%
> stream is ~10GB+ a day)
>
> 2) construct an efficient cascading language filter, ie:
> - first test the computationally cheap AND precise attributes, such as a
> list of known farsi-only keywords or the location box
> - if those attribute tests are negative, perform more computationally
> expensive tests
> - if in doubt, count it as non-farsi! False positives will kill you if you
> sample a very small population!
>
> 3) With said filter, identify the accounts using farsi
>
> 4) Perform a first-degree network sweep and scan all their
> friends+followers, since those have a higher likelihood to speak farsi as
> well
>
> 5) compile a list of those known users
>
> 6) track those users with the shadow role stream (80.000 users) or higher.
>
> If your language detection code is not efficient enough, you might want to
> include a cheap, fast and precise negative filter of known non-farsi
> attributes. Test that one before all the others and you should be able to
> filter out a large part of the volume.
>
>
> Don't hesitate to ask for any clarification!
>
> Pascal Juergens
> Graduate Student / Mass Communication
> U of Mainz, Germany
>
> On Jul 3, 2010, at 0:36 , Lucas Vickers wrote:
>
> > Hello,
> >
> > I am trying to create an app that will show tweets and trends in
> > Farsi, for native speakers.  I would like to somehow get a sample
> > 'garden hose' of Farsi based tweets, but I am unable to come up with
> > an elegant solution.
> >
> > I see the following options:
> >
> > - Sample all tweets, and run a language detection algorithm on the
> > tweet to determine which are/could be Farsi.
> >  * Problem: only a very very small % of the tweets will be in Farsi
> >
> > - Use the location filter to try and sample tweets from countries that
> > are known to speak Farsi, and then run a language detection algorithm
> > on the tweets.
> >  * Problem: I seem to be limited on the size of the coordinate box I
> > can provide.  I can not even cover all of Iran for example.
> >
> > - Filter a standard farsi term.
> >  * Problem: will limit my results to only tweets with this term
> >
> > - Search for laguage = farsi
> >   * Problem: Not a stream, I will need to keep searching.
> >
> > I think of the given options I mentioned what makes the most sense is
> > to search for tweets where language=farsi, and use the since_id to
> > keep my results new.  Given this method, I have three questions
> > 1 - since_id I imagine is the highest tweet_id from the previous
> > result set?
> > 2 - How often can I search (given API limits of course) in order to
> > ensure I get new data?
> > 3 - Will the language filter provide me with users who's default
> > language is farsi, or will it actually find tweets in farsi?
> >
> > I am aware that the user can select their native language in the user
> > profile, but I also know this is not 100% reliable.
> >
> > Can anyone think of a more elegant solution?
> > Are there any hidden/experimental language type filters available to
> > us?
> >
> > Thanks!
> > Lucas
>
>


-- 
Furkan Kuru


Re: [twitter-dev] Farsi Twitter App

2010-07-03 Thread Pascal Jürgens
John,

yes, thanks a lot for the design proposal - that is what inspired my own 
system. I am not primarily filtering by language, however, but by country, so 
I'm using time zone and location data together with a list of cities from 
http://www.geonames.org/

The manual cross-check in my thesis shows that this gets you close to 1 in 
specificity and above .7 in sensitivity.

From my experience, the key is to develop efficient language-specific tests 
with as low an error rate as possible (this, sadly, largely excludes 
conventional SVM, HMM models etc, because tweets are so short and full of weird 
punctuation).

Pascal

On Jul 3, 2010, at 15:26 , John Kalucki wrote:

> It's great to hear that someone implemented all this. There's a similar 
> technique documented here: 
> http://dev.twitter.com/pages/streaming_api_concepts, under By Language and 
> Country. My suggestion was to start with a list of stop words to build your 
> user corpus -- but I don't know how well Farsi works with track, so random 
> sample method might indeed be better.
> 
> -John Kalucki
> http://twitter.com/jkalucki
> Infrastructure, Twitter Inc.



Re: [twitter-dev] Farsi Twitter App

2010-07-03 Thread John Kalucki
It's great to hear that someone implemented all this. There's a similar
technique documented here:
http://dev.twitter.com/pages/streaming_api_concepts, under By Language and
Country. My suggestion was to start with a list of stop words to build your
user corpus -- but I don't know how well Farsi works with track, so random
sample method might indeed be better.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.




2010/7/3 Pascal Jürgens 

> Hi Lucas,
>
> as someone who approached a similar problem, my recommendation would be to
> track users.  In order to get results quickly (rather than every few hours
> via user timeline calls), you need streaming access, which is a bit more
> complicated. I implemented such a system in order to track the
> german-speaking population of twitter users, and it works extremely well.
>
> 1) get access to the sample stream (5% or 15% type) (warning: the 15%
> stream is ~10GB+ a day)
>
> 2) construct an efficient cascading language filter, ie:
> - first test the computationally cheap AND precise attributes, such as a
> list of known farsi-only keywords or the location box
> - if those attribute tests are negative, perform more computationally
> expensive tests
> - if in doubt, count it as non-farsi! False positives will kill you if you
> sample a very small population!
>
> 3) With said filter, identify the accounts using farsi
>
> 4) Perform a first-degree network sweep and scan all their
> friends+followers, since those have a higher likelihood to speak farsi as
> well
>
> 5) compile a list of those known users
>
> 6) track those users with the shadow role stream (80.000 users) or higher.
>
> If your language detection code is not efficient enough, you might want to
> include a cheap, fast and precise negative filter of known non-farsi
> attributes. Test that one before all the others and you should be able to
> filter out a large part of the volume.
>
>
> Don't hesitate to ask for any clarification!
>
> Pascal Juergens
> Graduate Student / Mass Communication
> U of Mainz, Germany
>
> On Jul 3, 2010, at 0:36 , Lucas Vickers wrote:
>
> > Hello,
> >
> > I am trying to create an app that will show tweets and trends in
> > Farsi, for native speakers.  I would like to somehow get a sample
> > 'garden hose' of Farsi based tweets, but I am unable to come up with
> > an elegant solution.
> >
> > I see the following options:
> >
> > - Sample all tweets, and run a language detection algorithm on the
> > tweet to determine which are/could be Farsi.
> >  * Problem: only a very very small % of the tweets will be in Farsi
> >
> > - Use the location filter to try and sample tweets from countries that
> > are known to speak Farsi, and then run a language detection algorithm
> > on the tweets.
> >  * Problem: I seem to be limited on the size of the coordinate box I
> > can provide.  I can not even cover all of Iran for example.
> >
> > - Filter a standard farsi term.
> >  * Problem: will limit my results to only tweets with this term
> >
> > - Search for laguage = farsi
> >   * Problem: Not a stream, I will need to keep searching.
> >
> > I think of the given options I mentioned what makes the most sense is
> > to search for tweets where language=farsi, and use the since_id to
> > keep my results new.  Given this method, I have three questions
> > 1 - since_id I imagine is the highest tweet_id from the previous
> > result set?
> > 2 - How often can I search (given API limits of course) in order to
> > ensure I get new data?
> > 3 - Will the language filter provide me with users who's default
> > language is farsi, or will it actually find tweets in farsi?
> >
> > I am aware that the user can select their native language in the user
> > profile, but I also know this is not 100% reliable.
> >
> > Can anyone think of a more elegant solution?
> > Are there any hidden/experimental language type filters available to
> > us?
> >
> > Thanks!
> > Lucas
>
>


Re: [twitter-dev] Farsi Twitter App

2010-07-03 Thread Pascal Jürgens
Hi Lucas,

as someone who approached a similar problem, my recommendation would be to 
track users.  In order to get results quickly (rather than every few hours via 
user timeline calls), you need streaming access, which is a bit more 
complicated. I implemented such a system in order to track the german-speaking 
population of twitter users, and it works extremely well.

1) get access to the sample stream (5% or 15% type) (warning: the 15% stream is 
~10GB+ a day)

2) construct an efficient cascading language filter, ie:
- first test the computationally cheap AND precise attributes, such as a list 
of known farsi-only keywords or the location box
- if those attribute tests are negative, perform more computationally expensive 
tests
- if in doubt, count it as non-farsi! False positives will kill you if you 
sample a very small population!

3) With said filter, identify the accounts using farsi

4) Perform a first-degree network sweep and scan all their friends+followers, 
since those have a higher likelihood to speak farsi as well

5) compile a list of those known users

6) track those users with the shadow role stream (80.000 users) or higher.

If your language detection code is not efficient enough, you might want to 
include a cheap, fast and precise negative filter of known non-farsi 
attributes. Test that one before all the others and you should be able to 
filter out a large part of the volume.


Don't hesitate to ask for any clarification!

Pascal Juergens
Graduate Student / Mass Communication
U of Mainz, Germany

On Jul 3, 2010, at 0:36 , Lucas Vickers wrote:

> Hello,
> 
> I am trying to create an app that will show tweets and trends in
> Farsi, for native speakers.  I would like to somehow get a sample
> 'garden hose' of Farsi based tweets, but I am unable to come up with
> an elegant solution.
> 
> I see the following options:
> 
> - Sample all tweets, and run a language detection algorithm on the
> tweet to determine which are/could be Farsi.
>  * Problem: only a very very small % of the tweets will be in Farsi
> 
> - Use the location filter to try and sample tweets from countries that
> are known to speak Farsi, and then run a language detection algorithm
> on the tweets.
>  * Problem: I seem to be limited on the size of the coordinate box I
> can provide.  I can not even cover all of Iran for example.
> 
> - Filter a standard farsi term.
>  * Problem: will limit my results to only tweets with this term
> 
> - Search for laguage = farsi
>   * Problem: Not a stream, I will need to keep searching.
> 
> I think of the given options I mentioned what makes the most sense is
> to search for tweets where language=farsi, and use the since_id to
> keep my results new.  Given this method, I have three questions
> 1 - since_id I imagine is the highest tweet_id from the previous
> result set?
> 2 - How often can I search (given API limits of course) in order to
> ensure I get new data?
> 3 - Will the language filter provide me with users who's default
> language is farsi, or will it actually find tweets in farsi?
> 
> I am aware that the user can select their native language in the user
> profile, but I also know this is not 100% reliable.
> 
> Can anyone think of a more elegant solution?
> Are there any hidden/experimental language type filters available to
> us?
> 
> Thanks!
> Lucas



[twitter-dev] Farsi Twitter App

2010-07-02 Thread Lucas Vickers
Hello,

I am trying to create an app that will show tweets and trends in
Farsi, for native speakers.  I would like to somehow get a sample
'garden hose' of Farsi based tweets, but I am unable to come up with
an elegant solution.

I see the following options:

- Sample all tweets, and run a language detection algorithm on the
tweet to determine which are/could be Farsi.
  * Problem: only a very very small % of the tweets will be in Farsi

- Use the location filter to try and sample tweets from countries that
are known to speak Farsi, and then run a language detection algorithm
on the tweets.
  * Problem: I seem to be limited on the size of the coordinate box I
can provide.  I can not even cover all of Iran for example.

- Filter a standard farsi term.
  * Problem: will limit my results to only tweets with this term

- Search for laguage = farsi
   * Problem: Not a stream, I will need to keep searching.

I think of the given options I mentioned what makes the most sense is
to search for tweets where language=farsi, and use the since_id to
keep my results new.  Given this method, I have three questions
1 - since_id I imagine is the highest tweet_id from the previous
result set?
2 - How often can I search (given API limits of course) in order to
ensure I get new data?
3 - Will the language filter provide me with users who's default
language is farsi, or will it actually find tweets in farsi?

I am aware that the user can select their native language in the user
profile, but I also know this is not 100% reliable.

Can anyone think of a more elegant solution?
Are there any hidden/experimental language type filters available to
us?

Thanks!
Lucas