Hello Pratik, We would use ShingleFilter for this indeed. If you only want bigrams/shingles, don't forget to disable outputUnigrams and set both shinle size limits to 2.
Regards, Markus -----Original message----- > From:Pratik Patel <[email protected]> > Sent: Thursday 15th November 2018 17:00 > To: [email protected] > Subject: Extracting important multi term phrases from the text > > Hello Everyone, > > Standard way of tokenizing in solr would divide the text by white space in > solr. > > Is there a way by which we can index multi-term phrases like "Machine > Learning" instead of "Machine", "Learning"? > Is it possible to create a specific field type for such phrases which has > its own indexing pipeline? I am open to storing n-grams but these n-grams > would be across terms and not just one term? In other words, I don't want > to store n-grams of the term "machine", I want to store n-grams for a > sentence like below. > > "I like machine learning" --> "I like", "like machine", "machine learning" > and so on..... > > It seems like Shingle Filter ( > https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter) > may be used for this. Is there a better alternative? > > I want to use this field as an input to Semantic Knowledge Graph. The > plugin works great for words. But now I want to use it for phrases. Any > idea around this would be really helpful. > > Thanks a lot! > > - Pratik >
