I think the underscore actually comes from the Shingles (parameter
fillerToken). Have you tried setting it to empty string?

Regards,
   Alex.
On Thu, 15 Nov 2018 at 17:16, Pratik Patel <[email protected]> wrote:
>
> Hi Markus,
>
> Thanks for the reply. I tried using ShingleFilter and it seems to
> be working. However, I am hitting an issue when it is used with
> StopWordFilter. StopWordFilter leaves an underscore "_" for removed words
> and it kind of screws up the data in index.
>
> I tried setting enablePositionIncrements="false" for stop word filter but
> that parameter only works for lucene version 4.3 or earlier. Looks like
> it's an open issue in lucene
> https://issues.apache.org/jira/browse/LUCENE-4065
>
> For now, I am trying to find a workaround using PatternReplaceFilterFactory.
>
> Regards,
> Pratik
>
> On Thu, Nov 15, 2018 at 4:15 PM Markus Jelsma <[email protected]>
> wrote:
>
> > Hello Pratik,
> >
> > We would use ShingleFilter for this indeed. If you only want
> > bigrams/shingles, don't forget to disable outputUnigrams and set both
> > shinle size limits to 2.
> >
> > Regards,
> > Markus
> >
> > -----Original message-----
> > > From:Pratik Patel <[email protected]>
> > > Sent: Thursday 15th November 2018 17:00
> > > To: [email protected]
> > > Subject: Extracting important multi term phrases from the text
> > >
> > > Hello Everyone,
> > >
> > > Standard way of tokenizing in solr would divide the text by white space
> > in
> > > solr.
> > >
> > > Is there a way by which we can index multi-term phrases like "Machine
> > > Learning" instead of "Machine", "Learning"?
> > > Is it possible to create a specific field type for such phrases which has
> > > its own indexing pipeline? I am open to storing n-grams but these n-grams
> > > would be across terms and not just one term? In other words, I don't want
> > > to store n-grams of the term "machine", I want to store n-grams for a
> > > sentence like below.
> > >
> > > "I like machine learning" --> "I like", "like machine", "machine
> > learning"
> > > and so on.....
> > >
> > > It seems like Shingle Filter (
> > >
> > https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter
> > )
> > > may be used for this. Is there a better alternative?
> > >
> > > I want to use this field as an input to Semantic Knowledge Graph. The
> > > plugin works great for words. But now I want to use it for phrases. Any
> > > idea around this would be really helpful.
> > >
> > > Thanks a lot!
> > >
> > > - Pratik
> > >
> >

Reply via email to