to add to Ted's reply, mahout has traditionally offered a bigram/trigram analysis as a part of its tf-idf conversion (a step away from the bag of words model so that directional statistically stable combinations of 2 or 3 words are reduced to their own term). However, this has not been ported to spark/h20/flink engines, and is available as a mapreduce legacy algorithm only.
On Sat, Jun 4, 2016 at 2:14 AM, forme book <forbookm...@gmail.com> wrote: > Hi, > > I'm start to study text processing and I see that for evaluating two text > is possible to obtaing vector model through TF-IDF technique. > > With Mahout is possible to create vectors from text with the use of > lucene.vector, if I have not misheard takes a lucene index and then map as > a tf-idf, > > On the (Lucene side) has already by default this implementations, what I do > struggle to understand what is the advantage of having lucene.vector in > mahout when Lucene offer that feature out of the box ? > > Maybe I'm missing something big but what’s the Connection Between then ? > could you please explain a possible user case ? > > Thanks for help > > Richard >