Nutch failing on SOLR text field

2019-03-26 Thread Dave Beckstrom
Hi Everyone,

This is probably more of a SOLR question but I'm hoping someone might be
able to help.  I'm  using Nutch to crawl and index some content.  It failed
on a SOLR field defined as a text field when it was trying to insert the
following value for the field:

33011-54192-EWHServer1234-3BA9D1CA-05B6-42BA-9D88-BAD970CAEEC6

The field was defined in the schema.xml as:



The error message said it was a RemoteSolrException from the server and
that it was an error adding the field.  I'm pretty certain the issue was
the value being inserted as it worked fine for 100's of pages and then
failed on the one page that had data formatted differently than on other
pages.

>From what I was able to find searching, it doesn't look like the length of
the data would be any issue at all for a text field.  I am wondering if the
problem is the dashes (hyphens) in the data?

Any suggestions on how to fix this?  I can delete the collection and
redefine it with a field other than text, if that is the answer.

Thank you!

Dave

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
  

Full-Service Solutions Integrator








Re: Nutch failing on SOLR text field

2019-03-26 Thread Dave Beckstrom
Hi Jorge,

I'm running Solr 7.3.1 which is compatible with the version of Nutch I'm
running.


Field is defined as:



I think this is the relevant part from the stack trace:

at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NumberFormatException: For input string:
"myfieldname:docid:33011-54192-XXHServer-3BA9D1CA-05B6-42BA-9D88-BAD970CAEEC6"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Long.parseLong(Unknown Source)
at java.lang.Long.parseLong(Unknown Source)
at
org.apache.solr.schema.LongPointField.createField(LongPointField.java:154)
at org.apache.solr.schema.PointField.createFields(PointField.java:250)
at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:66)
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:159)

It seems to be treating it as a number?



Best,

Dave

On Tue, Mar 26, 2019 at 5:06 PM Jorge Betancourt 
wrote:

> Hi Dave,
>
> Can you check the Solr logs and post the relevant exception?. Also it would
> be helpful if you attach the definition of the text field in your Solr
> collection.
>
> Best regards,
> Jorge
>
> On Tue, Mar 26, 2019 at 9:41 PM Dave Beckstrom 
> wrote:
>
> > Hi Everyone,
> >
> > This is probably more of a SOLR question but I'm hoping someone might be
> > able to help.  I'm  using Nutch to crawl and index some content.  It
> failed
> > on a SOLR field defined as a text field when it was trying to insert the
> > following value for the field:
> >
> > 33011-54192-EWHServer1234-3BA9D1CA-05B6-42BA-9D88-BAD970CAEEC6
> >
> > The field was defined in the schema.xml as:
> >
> >  > indexed="true"/>
> >
> > The error message said it was a RemoteSolrException from the server and
> > that it was an error adding the field.  I'm pretty certain the issue was
> > the value being inserted as it worked fine for 100's of pages and then
> > failed on the one page that had data formatted differently than on other
> > pages.
> >
> > From what I was able to find searching, it doesn't look like the length
> of
> > the data would be any issue at all for a text field.  I am wondering if
> the
> > problem is the dashes (hyphens) in the data?
> >
> > Any suggestions on how to fix this?  I can delete the collection and
> > redefine it with a field other than text, if that is the answer.
> >
> > Thank you!
> >
> > Dave
> >
> > --
> > *Fig Leaf Software, Inc.*
> > https://www.figleaf.com/
> > <https://www.figleaf.com/>
> >
> > Full-Service Solutions Integrator
> >
> >
> >
> >
> >
> >
> >
>

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
<https://www.figleaf.com/>  

Full-Service Solutions Integrator








Error Updating Solr

2019-02-28 Thread Dave Beckstrom
I'm getting much closer to getting Nutch and SOLR to play well together.
(Ryan - thanks for your help on my last question.  Your suggestion fixed
that issue)

What is happening now is that Nutch finishes crawling, then calls the
index-writer to update solr.  The SOLR update fails with this message:

2019-02-28 17:34:33,742 WARN  mapred.LocalJobRunner -
job_local966037581_0001
java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8981/solr/: copyField dest
:'metatag.description_str' is not an explicit field and doesn't match a
dynamicField.

The mapping section of the index writer originally had this value:



I have removed everything from the mapping section of the index-writer and
cycled the SOLR service.  My mapping section now looks like this:


  
  
  

  


I cannot find any reference to "metatag.description_str" in any of the
nutch xml files or the SOLR files.

Any idea of how I can fix this issue?

Thank you!

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
  

Full-Service Solutions Integrator








Configuring Nutch to work with Solr?

2019-02-27 Thread Dave Beckstrom
Hi Everyone,

I'm a developer and I am installing Nutch with Solr for a client.  I've
been reading everything I can get my hands on and I am just not finding the
answers to some questions.   I'm really hoping you can help!

I have Nutch 1.15 and Solr 7.3.1 installed on a Windows server.   Those
appeared to be the most current compatible versions where the Nutch
binaries were available.

The first question I have is regarding the "/conf/schema.xml"  that ships
with Nutch.  As I understand it, that file needs to be copied over to Solr
for use with the collection on Solr.

Is the schema.xml file used in the creation of the new collection on solr?
In other words, does the Nutch schema.xml file need to be copied to Solr
first and then the collection created or do you first create a collection
using the default schema.xml that ships with solr and after the collection
has been created then replace schema.xml with the nutch version of
schema.xml?

If I try and copy the nutch schema.xml over to SOLR first and then create a
collection it throws the following error:

  fieldType 'pdates' not found in the schema

Thanks!

Best,

Dave

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
  

Full-Service Solutions Integrator








Configuring Exchanges

2019-03-04 Thread Dave Beckstrom
Apparently exchanges.xml uses the jexl language.  I am trying to get it
configured to choose the  writer based on url criteria.

For example, let's say that I have a page that is located at the following
url:

http://www.somedomain.com/somedir/index.html

I want to select writer "indexer_solr_1" if the url for the page contains
the text "somedir" in the path.  I tried the following and it doesn't
work.  I also tried with "url" instead of "host" as the field being checked.


  
  


  
    

Any suggestions?

Thanks!




Best,

Dave Beckstrom
*Fig Leaf Software* <http://www.figleaf.com/> | "We've Got You Covered"
*Service-Disabled Veteran-Owned Small Business (SDVOSB)*
763-323-3499
dbeckst...@figleaf.com

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
<https://www.figleaf.com/>  

Full-Service Solutions Integrator








JEXL and Exchanges

2019-03-05 Thread Dave Beckstrom
Ryan and Roannel,

Thank you guys so much for your replies.  I didn't realize it but I was not
seeing all of the emails from you.

Roannel you sent some really helpful replies that never came in as an
email.  I found your replies when I browsed the web-based archives on the
apache site.   I wanted to make sure I thanked you for your help!!!

I can't find one example of an exchanges.xml other than what ships with
Nutch.   I'm really in the blind trying to get the exchanges to work.  I
believe this may be the last item I need help with and then I'll have Nutch
working the way I need it to.  Any help you can offer would be GREATLY
appreciated.

Let's say I have a document that was crawled and the URL for the document
was as follows:

http://www.somedomain.com/news/englishnews/2018/this-is-my-news-article.cfm

Here is the expression I have coded in exchanges.xml:



That expression is not triggering.  As near as I can tell the "=~" is the
"contains" expression.  The idea being if the url contains "englishnews"
then this expression should trigger.  I believe the slashes around
"englishnews" makes it function as a regular expression, which should
evaluate to true, rather then a string compare.

If anyone can help get me past this final road block I would greatly
appreciate the help!  I spent an entire day on this yesterday and got
nowhere.

Thank you!

Dave

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
  

Full-Service Solutions Integrator








Re: JEXL and Exchanges

2019-03-05 Thread Dave Beckstrom
Hi Sebastian,

Thank you sir!

Two things you provided solved the problem for me!  One was the correct
syntax for the regex but the other was when you provided the info on the
indexchecker command.  Part of what i was dealing with was not having much
to go on when debugging and that command helped a lot!

In addition, the following line gave me an important clue:

-Dplugin.includes='exchange-jexl|protocol-okhttp|parse-html|indexer-solr|index-(basic|more)'

I realized that I did not have  exchange-jexl  listed as a plug-in to
include via my nutch-site.xml config file.  I'd have never have figured
that out without the clue you provided.

The exchanges are working, content is going into the right collections,
life is good!

Thank you again!

Best,

Dave Beckstrom
*Fig Leaf Software* <http://www.figleaf.com/> | "We've Got You Covered"
*Service-Disabled Veteran-Owned Small Business (SDVOSB)*
763-323-3499
dbeckst...@figleaf.com


On Tue, Mar 5, 2019 at 12:44 PM Sebastian Nagel
 wrote:

> Hi Dave,
>
> I'm by now means an expert of the JEXL syntax (cf.
> (http://commons.apache.org/proper/commons-jexl/reference/syntax.html)
> but after a few trials the expression must be
>
>  doc.getFieldValue('url')=~'.*/englishnews/.*'
>
> It's easy to test using the indexchecker, e.g.
>  % bin/nutch indexchecker
>
> -Dplugin.includes='exchange-jexl|protocol-okhttp|parse-html|indexer-solr|index-(basic|more)'
> -DdoIndex=true   http://...
>
> If you want to improve the Wiki page
>https://wiki.apache.org/nutch/Exchanges
> we're happy to grant you write access to the wiki, see
>https://wiki.apache.org/nutch/
>
> Best,
> Sebastian
>
>
> On 3/5/19 4:06 PM, Dave Beckstrom wrote:
> > Ryan and Roannel,
> >
> > Thank you guys so much for your replies.  I didn't realize it but I was
> not
> > seeing all of the emails from you.
> >
> > Roannel you sent some really helpful replies that never came in as an
> > email.  I found your replies when I browsed the web-based archives on the
> > apache site.   I wanted to make sure I thanked you for your help!!!
> >
> > I can't find one example of an exchanges.xml other than what ships with
> > Nutch.   I'm really in the blind trying to get the exchanges to work.  I
> > believe this may be the last item I need help with and then I'll have
> Nutch
> > working the way I need it to.  Any help you can offer would be GREATLY
> > appreciated.
> >
> > Let's say I have a document that was crawled and the URL for the document
> > was as follows:
> >
> >
> http://www.somedomain.com/news/englishnews/2018/this-is-my-news-article.cfm
> >
> > Here is the expression I have coded in exchanges.xml:
> >
> > 
> >
> > That expression is not triggering.  As near as I can tell the "=~" is the
> > "contains" expression.  The idea being if the url contains "englishnews"
> > then this expression should trigger.  I believe the slashes around
> > "englishnews" makes it function as a regular expression, which should
> > evaluate to true, rather then a string compare.
> >
> > If anyone can help get me past this final road block I would greatly
> > appreciate the help!  I spent an entire day on this yesterday and got
> > nowhere.
> >
> > Thank you!
> >
> > Dave
> >
>
>

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
<https://www.figleaf.com/>  

Full-Service Solutions Integrator








parser.html.NodesToExclud

2019-09-12 Thread Dave Beckstrom
Hi All,

I'm running NUTCH 1.15.

In my nutch-site.xml I configured the below parameters and
specifically under   parser.html.NodesToExclude I'm telling it not to index
"div id=sidebar" or "div id=footer" and yet it continues to index those
regions on the page.

Does anyone have suggestions on why this isn't working and what I should do
to resolve this?

Thank you!





  tika.extractor
  boilerpipe
  
  Which text extraction algorithm to use. Valid values are: boilerpipe or
none.
  

 
  tika.extractor.boilerpipe.algorithm
  ArticleExtractor
  
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
ArticleExtractor
  or CanolaExtractor.
  


parser.html.NodesToExclude
div;id;sidebar|div;id;footer

  A list of nodes whose content will not be indexed separated by "|".
  Use this to tell the HTML parser to ignore, for example, site
navigation text.

  Each node has three elements, separated by semi-colon:
  the first one is the tag name,
  the second one the attribute name,
  the third one the value of the attribute.

  Example: table;summary;header|div;id;navigation

  Note that nodes with these attributes, and their children, will be
  silently ignored by the parser so verify the indexed content
  with Luke to confirm results.

  




Regards,

Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: dbeckst...@collectivefls.com 
ph: 763.323.3499

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/> 





Re: Injection from webservice

2019-09-16 Thread Dave Beckstrom
Or use a scheduled wget job to pull them from the remote server and store
them on a path that Nutch can access locally.

Regards,

Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: dbeckst...@collectivefls.com 
ph: 763.323.3499


On Mon, Sep 16, 2019 at 12:14 PM Jorge Betancourt <
betancourt.jo...@gmail.com> wrote:

> Hi Roannel,
>
> The current implementation of the injector only accepts a path (actually an
> org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> directly unless you download the content first.
>
> If you use the REST API you can send the seed file using the API endpoint.
> Otherwise, you could write your own injector with the proper logic to deal
> with a list of URLs coming from an URL.
>
> The REST API implementation just writes the content in the expected format
> (
>
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> )
>
> Best Regards,
> Jorge
>
> On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <
> roan...@uci.cu>
> wrote:
>
> > Hi folks,
> >
> > Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> > via http or https)?
> >
> > I mean this, for instance:
> >
> > bin/nutch inject crawl http://example.org/seed
> >
> > Regards
> > 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> > Por La Habana, lo más grande. #Habana500 #UCIxHabana500
> >
> >
>

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/> 





Nutch not crawling all pages

2019-10-30 Thread Dave Beckstrom
Hi Everyone,

I googled and researched and I am not finding any solutions.  I'm hoping
someone here can help.

I have txt files with about 50,000 seed urls that are fed to Nutch for
crawling and then indexing in SOLR.  However, it will not index more than
about 39,000 pages no matter what I do.   The robots.txt file gives Nutch
access to the entire site.

This is a snippet of the last Nutch run:

nerator: starting at 2019-10-30 14:44:38
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 8
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

I ran that crawl about 5 or 6  times.  It seems to index about 6,000 pages
per run.  I planned to keep running it until it hit the 50,000+ page mark
which would indicate that all of the pages where indexed.  That last run it
just ended without crawling anything more.

Below are some of the potentially relevent config settings.  I removed the
"description" for brevity.


  http.content.limit
  -1


 db.ignore.external.links
 true


 db.ignore.external.links.mode
 byDomain


  db.ignore.internal.links
  false


  db.update.additions.allowed
  true
 
 
 db.max.outlinks.per.page
  -1
 
 
  db.injector.overwrite
  true
 

Anyone have any suggestions?  Its odd that when you give nutch a specific
list of urls to be crawled that it wouldn't crawl all of them.

I appreicate any help you can offer.   Thank you!

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/  





Re: Nutch not crawling all pages

2019-10-30 Thread Dave Beckstrom
You guys were right.  We have one seed URL file which lists urls to 10
pages,  Each of those 10 pages has roughly 5,000 urls to be crawled.

The links to 3 out of the 10 pages were wrong (missing) -- which accounts
for roughly 15,000+ urls that were missing.  I didn't catch it because
there are multiple servers involved and everything was correct on the
server I was working on but on one of the other servers they were wrong.

I didn't set that part up so that was a mistake someone made before my
time.  But you guys clued me in to it.

Thank you!




On Wed, Oct 30, 2019 at 6:11 PM Markus Jelsma 
wrote:

> Hello,
>
> The CrawlDB does not lie, but you are two pages short of being indexed.
> That can happen for various different reasons and is hard to debug. But
> Bruno's point is valid. If you inject 50k but end up with 39k in the DB,
> this means some are filtered or multiple URLs were normalized back to the
> same.
>
> My experience with websites generating valid URLs only, is that this
> assumption is almost never true. In our case, out of the thousands of sites
> maybe only a few of those with just a dozen URLs are free from errors, e.g.
> not having ambiguous URLs, redirects or 404s or otherwise bogus entries.
>
> Markus
>
>
> -Original message-
> > From:Bruno Osiek 
> > Sent: Wednesday 30th October 2019 23:51
> > To: user@nutch.apache.org
> > Subject: Re: Nutch not crawling all pages
> >
> > What is the output of the inject command, ie, when you inject the 5
> > seeds justo before generating the first segment?
> >
> > On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom <
> dbeckst...@collectivefls.com>
> > wrote:
> >
> > > Hi Markus,
> > >
> > > Thank you so much for the reply and the help!  The seed URL list is
> > > generated from a CMS.  I'm doubtful that many of the urls would be for
> > > redirects or missing pages as the CMS only writes out the urls for
> valid
> > > pages.  It's got me stumped!
> > >
> > > Here is the result of the readdb.  Not sure why the dates are wonky.
> The
> > > date on the server is correct.  SOLR shows 39148 pages.
> > >
> > > TOTAL urls: 39164
> > > shortest fetch interval:30 days, 00:00:00
> > > avg fetch interval: 30 days, 00:07:10
> > > longest fetch interval: 45 days, 00:00:00
> > > earliest fetch time:Mon Nov 25 07:08:00 EST 2019
> > > avg of fetch times: Wed Nov 27 18:46:00 EST 2019
> > > latest fetch time:  Sat Dec 14 08:18:00 EST 2019
> > > retry 0:39164
> > > score quantile 0.01:1.8460402498021722E-4
> > > score quantile 0.05:1.8460402498021722E-4
> > > score quantile 0.1: 1.8460402498021722E-4
> > > score quantile 0.2: 1.8642803479451686E-4
> > > score quantile 0.25:1.8642803479451686E-4
> > > score quantile 0.3: 1.960784284165129E-4
> > > score quantile 0.4: 1.9663813566079454E-4
> > > score quantile 0.5: 2.0251113164704293E-4
> > > score quantile 0.6: 2.037905069300905E-4
> > > score quantile 0.7: 2.1473052038345486E-4
> > > score quantile 0.75:2.1473052038345486E-4
> > > score quantile 0.8: 2.172968233935535E-4
> > > score quantile 0.9: 2.429802336152917E-4
> > > score quantile 0.95:2.4354603374376893E-4
> > > score quantile 0.99:2.542474209925616E-4
> > > min score:  3.0443254217971116E-5
> > > avg score:  7.001118352666182E-4
> > > max score:  1.3120110034942627
> > > status 2 (db_fetched):  39150
> > > status 3 (db_gone): 13
> > > status 4 (db_redir_temp):   1
> > > CrawlDb statistics: done
> > >
> > >
> > >
> > > On Wed, Oct 30, 2019 at 4:01 PM Markus Jelsma <
> markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > Hello Dave,
> > > >
> > > > First you should check the CrawlDB using readdb -stats. My bet is
> that
> > > > your set contains some redirects and gone (404), or transient
> errors. The
> > > > number for fetched and notModified added up should be about the same
> as
> > > the
> > > > number of documents indexed.
> > > >
> > > > Regards,
> > > > Markus
> > > >
> > > >
> > > >
> > > > -Original message-
> > > > > From:Dave Beckstrom 
> > > > > Sent: Wednesday 30th October 2019 20:00
> > > > > To: user@nutch.apache.org
> > > > > Subject:

Re: Nutch not crawling all pages

2019-10-30 Thread Dave Beckstrom
Hi Markus,

Thank you so much for the reply and the help!  The seed URL list is
generated from a CMS.  I'm doubtful that many of the urls would be for
redirects or missing pages as the CMS only writes out the urls for valid
pages.  It's got me stumped!

Here is the result of the readdb.  Not sure why the dates are wonky.  The
date on the server is correct.  SOLR shows 39148 pages.

TOTAL urls: 39164
shortest fetch interval:30 days, 00:00:00
avg fetch interval: 30 days, 00:07:10
longest fetch interval: 45 days, 00:00:00
earliest fetch time:Mon Nov 25 07:08:00 EST 2019
avg of fetch times: Wed Nov 27 18:46:00 EST 2019
latest fetch time:  Sat Dec 14 08:18:00 EST 2019
retry 0:39164
score quantile 0.01:1.8460402498021722E-4
score quantile 0.05:1.8460402498021722E-4
score quantile 0.1: 1.8460402498021722E-4
score quantile 0.2: 1.8642803479451686E-4
score quantile 0.25:1.8642803479451686E-4
score quantile 0.3: 1.960784284165129E-4
score quantile 0.4: 1.9663813566079454E-4
score quantile 0.5: 2.0251113164704293E-4
score quantile 0.6: 2.037905069300905E-4
score quantile 0.7: 2.1473052038345486E-4
score quantile 0.75:2.1473052038345486E-4
score quantile 0.8: 2.172968233935535E-4
score quantile 0.9: 2.429802336152917E-4
score quantile 0.95:2.4354603374376893E-4
score quantile 0.99:2.542474209925616E-4
min score:  3.0443254217971116E-5
avg score:  7.001118352666182E-4
max score:  1.3120110034942627
status 2 (db_fetched):  39150
status 3 (db_gone): 13
status 4 (db_redir_temp):   1
CrawlDb statistics: done



On Wed, Oct 30, 2019 at 4:01 PM Markus Jelsma 
wrote:

> Hello Dave,
>
> First you should check the CrawlDB using readdb -stats. My bet is that
> your set contains some redirects and gone (404), or transient errors. The
> number for fetched and notModified added up should be about the same as the
> number of documents indexed.
>
> Regards,
> Markus
>
>
>
> -Original message-
> > From:Dave Beckstrom 
> > Sent: Wednesday 30th October 2019 20:00
> > To: user@nutch.apache.org
> > Subject: Nutch not crawling all pages
> >
> > Hi Everyone,
> >
> > I googled and researched and I am not finding any solutions.  I'm hoping
> > someone here can help.
> >
> > I have txt files with about 50,000 seed urls that are fed to Nutch for
> > crawling and then indexing in SOLR.  However, it will not index more than
> > about 39,000 pages no matter what I do.   The robots.txt file gives Nutch
> > access to the entire site.
> >
> > This is a snippet of the last Nutch run:
> >
> > nerator: starting at 2019-10-30 14:44:38
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 8
> > Generator: 0 records selected for fetching, exiting ...
> > Generate returned 1 (no new segments created)
> > Escaping loop: no more URLs to fetch now
> >
> > I ran that crawl about 5 or 6  times.  It seems to index about 6,000
> pages
> > per run.  I planned to keep running it until it hit the 50,000+ page mark
> > which would indicate that all of the pages where indexed.  That last run
> it
> > just ended without crawling anything more.
> >
> > Below are some of the potentially relevent config settings.  I removed
> the
> > "description" for brevity.
> >
> > 
> >   http.content.limit
> >   -1
> > 
> > 
> >  db.ignore.external.links
> >  true
> > 
> > 
> >  db.ignore.external.links.mode
> >  byDomain
> > 
> > 
> >   db.ignore.internal.links
> >   false
> > 
> > 
> >   db.update.additions.allowed
> >   true
> >  
> >  
> >  db.max.outlinks.per.page
> >   -1
> >  
> >  
> >   db.injector.overwrite
> >   true
> >  
> >
> > Anyone have any suggestions?  Its odd that when you give nutch a specific
> > list of urls to be crawled that it wouldn't crawl all of them.
> >
> > I appreicate any help you can offer.   Thank you!
> >
> > --
> > *Fig Leaf Software is now Collective FLS, Inc.*
> > *
> > *
> > *Collective FLS, Inc.*
> >
> > https://www.collectivefls.com/ 
> >
> >
> >
> >
>

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/  





Crawl Command Question

2019-10-19 Thread Dave Beckstrom
Hi Everyone,

Reading the help for the nutch crawl script, I have a question.  If I run
the crawl script without the -i parameter, does that mean the crawl will
run and complete without updating SOLR?  I need to crawl pages without
updating SOLR.  Then I'll use solrindex to push the crawled content into
SOLR later, when I'm ready.



Usage: crawl [-i|--index] [-D "key=value"] [-s ]  
-i|--index Indexes crawl results into a configured indexer
-D... A Java property to pass to Nutch calls
-s  Directory in which to look for a seeds file
 Directory where the crawl/link/segments dirs are saved
 The number of rounds to run this crawl for
 Example: bin/crawl -i -s urls/ TestCrawl/  2

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/  





Excluding individual pages?

2019-10-10 Thread Dave Beckstrom
Hi Everyone,

I searched and didn't find an answer.

Nutch is indexing the content of the page that has the seed urls in it and
then that page shows up in the SOLR search results.   We don't want that to
happen.

Is there a way to have nutch crawl the seed url page but not push that page
into SOLR?  If not, is there a way to have a particular page excluded from
the SOLR search results?  Either way I'm trying to not have that page show
in search results.

Thank you!

Dave

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/  





Re: Nutch excludeNodes Patch

2019-10-10 Thread Dave Beckstrom
Markus,

Thank you so much for the reply!

I made the change to  parse-plugins.xml  and the plug-in is being called
now.  That plug-in didn't work so I changed to the blacklist-whitelist
plug-in and I've got it working thanks to your help!

 Dave

On Wed, Oct 9, 2019 at 4:00 PM Markus Jelsma 
wrote:

> Hello Dave,
>
> You have both TikaParser and HtmlParser enabled. This probably means you
> never use HtmlParser but always TikaParser. You can instruct Nutch via
> parse-plugins.xml which Parser impl. to choose based on MIME-type. If you
> select HtmlParser for html and xhtml, Nutch should use HtmlParser instead.
>
> Regards,
> Markus
>
> -Original message-
> > From:Dave Beckstrom 
> > Sent: Wednesday 9th October 2019 22:10
> > To: user@nutch.apache.org
> > Subject: Nutch excludeNodes Patch
> >
> > Hi Everyone!
> >
> >
> > We are running Nutch 1.15.
> >
> > We are trying to implement the nutch-585-excludeNodes.patch described on:
> > https://issues.apache.org/jira/browse/NUTCH-585
> >
> > It's acting like it's not running.  We don't get an error when the crawl
> > runs, no errors in the hadoop logs, it just doesn't exclude the content
> > from the page.
> >
> > We installed it in the directory plugins>parse-html
> >
> > We added the following to our nutch-site.xml to exclude div id=sidebar
> >
> > 
> >   parser.html.NodesToExclude
> >   div;id;sidebar
> >   
> >   A list of nodes whose content will not be indexed separated by "|".
> Use
> > this to tell
> >   the HTML parser to ignore, for example, site navigation text.
> >   Each node has three elements: the first one is the tag name, the second
> > one the
> >   attribute name, the third one the value of the attribute.
> >   Note that nodes with these attributes, and their children, will be
> > silently ignored by the parser
> >   so verify the indexed content with Luke to confirm results.
> >   
> > 
> >
> > Here is our plugin.includes property from nutch-site.xml
> >
> >  
> >   plugin.includes
> >
> >
> exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
> >plugins
> >   
> >  
> >
> > One question I have is  would having Tika configured in nutch-site.xml
> like
> > the following  cause any problems with the parse-html plugin not running?
> >
> > 
> >   tika.extractor
> >   boilerpipe
> >   
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   
> > 
> >  
> > 
> >   tika.extractor.boilerpipe.algorithm
> >   ArticleExtractor
> >   
> >   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> > ArticleExtractor
> >   or CanolaExtractor.
> >   
> > 
> >
> > We don't have a lot to go on to debug the issue.  The plugin has logic to
> > enable logging:
> >
> > if (LOG.isTraceEnabled())
> > +LOG.trace("Stripping " + pNode.getNodeName() + "#" +
> > idNode.getNodeValue());
> >
> > But nothing shows in the log files when we crawl. I
> > updated log4j.properties setting these two values to TRACE thinking I had
> > to enable trace before the logging would work:
> >
> >  log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout
> >  log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout
> >
> > I reran the crawl and no logging occurred and of course the content  we
> > didn't want crawled and indexed is still showing up in SOLR.
> >
> > I could really use some help and suggestions!
> >
> > Thank you!
> >
> > Dave Beckstrom
> >
> > --
> > *Fig Leaf Software is now Collective FLS, Inc.*
> > *
> > *
> > *Collective FLS, Inc.*
> >
> > https://www.collectivefls.com/ <https://www.collectivefls.com/>
> >
> >
> >
> >
>

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/> 





Nutch excludeNodes Patch

2019-10-09 Thread Dave Beckstrom
Hi Everyone!


We are running Nutch 1.15.

We are trying to implement the nutch-585-excludeNodes.patch described on:
https://issues.apache.org/jira/browse/NUTCH-585

It's acting like it's not running.  We don't get an error when the crawl
runs, no errors in the hadoop logs, it just doesn't exclude the content
from the page.

We installed it in the directory plugins>parse-html

We added the following to our nutch-site.xml to exclude div id=sidebar


  parser.html.NodesToExclude
  div;id;sidebar
  
  A list of nodes whose content will not be indexed separated by "|".  Use
this to tell
  the HTML parser to ignore, for example, site navigation text.
  Each node has three elements: the first one is the tag name, the second
one the
  attribute name, the third one the value of the attribute.
  Note that nodes with these attributes, and their children, will be
silently ignored by the parser
  so verify the indexed content with Luke to confirm results.
  


Here is our plugin.includes property from nutch-site.xml

 
  plugin.includes

exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)
   plugins
  
 

One question I have is  would having Tika configured in nutch-site.xml like
the following  cause any problems with the parse-html plugin not running?


  tika.extractor
  boilerpipe
  
  Which text extraction algorithm to use. Valid values are: boilerpipe or
none.
  

 

  tika.extractor.boilerpipe.algorithm
  ArticleExtractor
  
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
ArticleExtractor
  or CanolaExtractor.
  


We don't have a lot to go on to debug the issue.  The plugin has logic to
enable logging:

if (LOG.isTraceEnabled())
+LOG.trace("Stripping " + pNode.getNodeName() + "#" +
idNode.getNodeValue());

But nothing shows in the log files when we crawl. I
updated log4j.properties setting these two values to TRACE thinking I had
to enable trace before the logging would work:

 log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout
 log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout

I reran the crawl and no logging occurred and of course the content  we
didn't want crawled and indexed is still showing up in SOLR.

I could really use some help and suggestions!

Thank you!

Dave Beckstrom

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/> 





metatags missing with parse-html

2019-10-11 Thread Dave Beckstrom
Hi Everyone,

It seems like I take 1 step forward and 2 steps backwards.

I was using parse-tika and I needed to change to parse-html in order to use
a plug-in for excluding content such as headers and footers.

I have the excludes working with the plug-in.  But now I see that all of
the metatags are missing from solr.  The metatag fields are defined in SOLR
but not populated.

Metatags were working prior to the change to parse-html.  What would
explain the metatags not being indexed when the configuration
parameters didn't change?  Is there some other setting for parse-html that
I need to look into?

Thanks!


 
  plugin.includes

exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist
   
 
 
 
  metatags.names
  *
   
 
 
  index.parse.md
   metatag.language,metatag.subject,metatag.category
   


-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/