Re: Extracting important multi term phrases from the text

David Hastings Fri, 16 Nov 2018 05:37:08 -0800

Which function of the SKG are you using?  significantTerms?

On Thu, Nov 15, 2018 at 7:09 PM Alexandre Rafalovitch <[email protected]>
wrote:


> I think the underscore actually comes from the Shingles (parameter
> fillerToken). Have you tried setting it to empty string?
>
> Regards,
>    Alex.
> On Thu, 15 Nov 2018 at 17:16, Pratik Patel <[email protected]> wrote:
> >
> > Hi Markus,
> >
> > Thanks for the reply. I tried using ShingleFilter and it seems to
> > be working. However, I am hitting an issue when it is used with
> > StopWordFilter. StopWordFilter leaves an underscore "_" for removed words
> > and it kind of screws up the data in index.
> >
> > I tried setting enablePositionIncrements="false" for stop word filter but
> > that parameter only works for lucene version 4.3 or earlier. Looks like
> > it's an open issue in lucene
> > https://issues.apache.org/jira/browse/LUCENE-4065
> >
> > For now, I am trying to find a workaround using
> PatternReplaceFilterFactory.
> >
> > Regards,
> > Pratik
> >
> > On Thu, Nov 15, 2018 at 4:15 PM Markus Jelsma <
> [email protected]>
> > wrote:
> >
> > > Hello Pratik,
> > >
> > > We would use ShingleFilter for this indeed. If you only want
> > > bigrams/shingles, don't forget to disable outputUnigrams and set both
> > > shinle size limits to 2.
> > >
> > > Regards,
> > > Markus
> > >
> > > -----Original message-----
> > > > From:Pratik Patel <[email protected]>
> > > > Sent: Thursday 15th November 2018 17:00
> > > > To: [email protected]
> > > > Subject: Extracting important multi term phrases from the text
> > > >
> > > > Hello Everyone,
> > > >
> > > > Standard way of tokenizing in solr would divide the text by white
> space
> > > in
> > > > solr.
> > > >
> > > > Is there a way by which we can index multi-term phrases like "Machine
> > > > Learning" instead of "Machine", "Learning"?
> > > > Is it possible to create a specific field type for such phrases
> which has
> > > > its own indexing pipeline? I am open to storing n-grams but these
> n-grams
> > > > would be across terms and not just one term? In other words, I don't
> want
> > > > to store n-grams of the term "machine", I want to store n-grams for a
> > > > sentence like below.
> > > >
> > > > "I like machine learning" --> "I like", "like machine", "machine
> > > learning"
> > > > and so on.....
> > > >
> > > > It seems like Shingle Filter (
> > > >
> > >
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter
> > > )
> > > > may be used for this. Is there a better alternative?
> > > >
> > > > I want to use this field as an input to Semantic Knowledge Graph. The
> > > > plugin works great for words. But now I want to use it for phrases.
> Any
> > > > idea around this would be really helpful.
> > > >
> > > > Thanks a lot!
> > > >
> > > > - Pratik
> > > >
> > >
>

Re: Extracting important multi term phrases from the text

Reply via email to