Yes, the "bigram" in that demo only has two characters, which could
separate different character sets. -Xiangrui

On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei <liquan...@gmail.com> wrote:
> The program computes hashing bi-gram frequency normalized by total number of
> bigrams then filter out zero values. hashing is a effective trick of
> vectorizing features. Take a look at
> http://en.wikipedia.org/wiki/Feature_hashing
>
> Liquan
>
> On Wed, Oct 1, 2014 at 2:18 PM, Soumya Simanta <soumya.sima...@gmail.com>
> wrote:
>>
>> I'm trying to understand the intuition behind the features method that
>> Aaron used in one of his demos. I believe this feature will just work for
>> detecting the character set (i.e., language used).
>>
>> Can someone help ?
>>
>>
>> def featurize(s: String): Vector = {
>>   val n = 1000
>>   val result = new Array[Double](n)
>>   val bigrams = s.sliding(2).toArray
>>
>>   for (h <- bigrams.map(_.hashCode % n)) {
>>     result(h) += 1.0 / bigrams.length
>>   }
>>
>>   Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap))
>> }
>>
>>
>>
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to