Hello Pratik,

We would use ShingleFilter for this indeed. If you only want bigrams/shingles, 
don't forget to disable outputUnigrams and set both shinle size limits to 2.

Regards,
Markus

-----Original message-----
> From:Pratik Patel <[email protected]>
> Sent: Thursday 15th November 2018 17:00
> To: [email protected]
> Subject: Extracting important multi term phrases from the text
> 
> Hello Everyone,
> 
> Standard way of tokenizing in solr would divide the text by white space in
> solr.
> 
> Is there a way by which we can index multi-term phrases like "Machine
> Learning" instead of "Machine", "Learning"?
> Is it possible to create a specific field type for such phrases which has
> its own indexing pipeline? I am open to storing n-grams but these n-grams
> would be across terms and not just one term? In other words, I don't want
> to store n-grams of the term "machine", I want to store n-grams for a
> sentence like below.
> 
> "I like machine learning" --> "I like", "like machine", "machine learning"
> and so on.....
> 
> It seems like Shingle Filter (
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter)
> may be used for this. Is there a better alternative?
> 
> I want to use this field as an input to Semantic Knowledge Graph. The
> plugin works great for words. But now I want to use it for phrases. Any
> idea around this would be really helpful.
> 
> Thanks a lot!
> 
> - Pratik
> 

Reply via email to