Yes, the "bigram" in that demo only has two characters, which could separate different character sets. -Xiangrui
On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei <liquan...@gmail.com> wrote: > The program computes hashing bi-gram frequency normalized by total number of > bigrams then filter out zero values. hashing is a effective trick of > vectorizing features. Take a look at > http://en.wikipedia.org/wiki/Feature_hashing > > Liquan > > On Wed, Oct 1, 2014 at 2:18 PM, Soumya Simanta <soumya.sima...@gmail.com> > wrote: >> >> I'm trying to understand the intuition behind the features method that >> Aaron used in one of his demos. I believe this feature will just work for >> detecting the character set (i.e., language used). >> >> Can someone help ? >> >> >> def featurize(s: String): Vector = { >> val n = 1000 >> val result = new Array[Double](n) >> val bigrams = s.sliding(2).toArray >> >> for (h <- bigrams.map(_.hashCode % n)) { >> result(h) += 1.0 / bigrams.length >> } >> >> Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap)) >> } >> >> >> > > > > -- > Liquan Pei > Department of Physics > University of Massachusetts Amherst --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org