RE: Excluding individual pages?

2019-10-10 Thread Markus Jelsma
Hello Dave,

If you have just one specific page you do not want Nutch to index, or Solr to 
show, you can either create a custom IndexingFilter that returns null 
(rejecting it) for the specified URL, or add an additional filterQuery to Solr, 
fq=-id:, filtering the specific URL from the results.

If there are more than a few URLs you want to exclude from indexing, and they 
have a pattern, you can uses regular expressions in the IndexingFilter or Solr 
filterQuery.

This is manual intervention, and only possible if your set is small enough, and 
does not change frequently. If this is not the case, you need more rigorous 
tools to detect and reject - what we call - hub pages or overview pages.

Regards,
Markus
 
-Original message-
> From:Dave Beckstrom 
> Sent: Thursday 10th October 2019 22:34
> To: user@nutch.apache.org
> Subject: Excluding individual pages?
> 
> Hi Everyone,
> 
> I searched and didn't find an answer.
> 
> Nutch is indexing the content of the page that has the seed urls in it and
> then that page shows up in the SOLR search results.   We don't want that to
> happen.
> 
> Is there a way to have nutch crawl the seed url page but not push that page
> into SOLR?  If not, is there a way to have a particular page excluded from
> the SOLR search results?  Either way I'm trying to not have that page show
> in search results.
> 
> Thank you!
> 
> Dave
> 
> -- 
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.* 
> 
> https://www.collectivefls.com/  
> 
> 
> 
> 


Excluding individual pages?

2019-10-10 Thread Dave Beckstrom
Hi Everyone,

I searched and didn't find an answer.

Nutch is indexing the content of the page that has the seed urls in it and
then that page shows up in the SOLR search results.   We don't want that to
happen.

Is there a way to have nutch crawl the seed url page but not push that page
into SOLR?  If not, is there a way to have a particular page excluded from
the SOLR search results?  Either way I'm trying to not have that page show
in search results.

Thank you!

Dave

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/  





Re: Nutch excludeNodes Patch

2019-10-10 Thread Dave Beckstrom
Markus,

Thank you so much for the reply!

I made the change to  parse-plugins.xml  and the plug-in is being called
now.  That plug-in didn't work so I changed to the blacklist-whitelist
plug-in and I've got it working thanks to your help!

 Dave

On Wed, Oct 9, 2019 at 4:00 PM Markus Jelsma 
wrote:

> Hello Dave,
>
> You have both TikaParser and HtmlParser enabled. This probably means you
> never use HtmlParser but always TikaParser. You can instruct Nutch via
> parse-plugins.xml which Parser impl. to choose based on MIME-type. If you
> select HtmlParser for html and xhtml, Nutch should use HtmlParser instead.
>
> Regards,
> Markus
>
> -Original message-
> > From:Dave Beckstrom 
> > Sent: Wednesday 9th October 2019 22:10
> > To: user@nutch.apache.org
> > Subject: Nutch excludeNodes Patch
> >
> > Hi Everyone!
> >
> >
> > We are running Nutch 1.15.
> >
> > We are trying to implement the nutch-585-excludeNodes.patch described on:
> > https://issues.apache.org/jira/browse/NUTCH-585
> >
> > It's acting like it's not running.  We don't get an error when the crawl
> > runs, no errors in the hadoop logs, it just doesn't exclude the content
> > from the page.
> >
> > We installed it in the directory plugins>parse-html
> >
> > We added the following to our nutch-site.xml to exclude div id=sidebar
> >
> > 
> >   parser.html.NodesToExclude
> >   div;id;sidebar
> >   
> >   A list of nodes whose content will not be indexed separated by "|".
> Use
> > this to tell
> >   the HTML parser to ignore, for example, site navigation text.
> >   Each node has three elements: the first one is the tag name, the second
> > one the
> >   attribute name, the third one the value of the attribute.
> >   Note that nodes with these attributes, and their children, will be
> > silently ignored by the parser
> >   so verify the indexed content with Luke to confirm results.
> >   
> > 
> >
> > Here is our plugin.includes property from nutch-site.xml
> >
> >  
> >   plugin.includes
> >
> >
> exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
> >plugins
> >   
> >  
> >
> > One question I have is  would having Tika configured in nutch-site.xml
> like
> > the following  cause any problems with the parse-html plugin not running?
> >
> > 
> >   tika.extractor
> >   boilerpipe
> >   
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   
> > 
> >  
> > 
> >   tika.extractor.boilerpipe.algorithm
> >   ArticleExtractor
> >   
> >   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> > ArticleExtractor
> >   or CanolaExtractor.
> >   
> > 
> >
> > We don't have a lot to go on to debug the issue.  The plugin has logic to
> > enable logging:
> >
> > if (LOG.isTraceEnabled())
> > +LOG.trace("Stripping " + pNode.getNodeName() + "#" +
> > idNode.getNodeValue());
> >
> > But nothing shows in the log files when we crawl. I
> > updated log4j.properties setting these two values to TRACE thinking I had
> > to enable trace before the logging would work:
> >
> >  log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout
> >  log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout
> >
> > I reran the crawl and no logging occurred and of course the content  we
> > didn't want crawled and indexed is still showing up in SOLR.
> >
> > I could really use some help and suggestions!
> >
> > Thank you!
> >
> > Dave Beckstrom
> >
> > --
> > *Fig Leaf Software is now Collective FLS, Inc.*
> > *
> > *
> > *Collective FLS, Inc.*
> >
> > https://www.collectivefls.com/ 
> >
> >
> >
> >
>

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/