Nutch failing on SOLR text field

2019-03-26 Thread Dave Beckstrom
Hi Everyone, This is probably more of a SOLR question but I'm hoping someone might be able to help. I'm using Nutch to crawl and index some content. It failed on a SOLR field defined as a text field when it was trying to insert the following value for the field:

Re: Nutch failing on SOLR text field

2019-03-26 Thread Dave Beckstrom
your Solr > collection. > > Best regards, > Jorge > > On Tue, Mar 26, 2019 at 9:41 PM Dave Beckstrom > wrote: > > > Hi Everyone, > > > > This is probably more of a SOLR question but I'm hoping someone might be > > able to help. I'm using Nutch to

Error Updating Solr

2019-02-28 Thread Dave Beckstrom
I'm getting much closer to getting Nutch and SOLR to play well together. (Ryan - thanks for your help on my last question. Your suggestion fixed that issue) What is happening now is that Nutch finishes crawling, then calls the index-writer to update solr. The SOLR update fails with this

Configuring Nutch to work with Solr?

2019-02-27 Thread Dave Beckstrom
Hi Everyone, I'm a developer and I am installing Nutch with Solr for a client. I've been reading everything I can get my hands on and I am just not finding the answers to some questions. I'm really hoping you can help! I have Nutch 1.15 and Solr 7.3.1 installed on a Windows server. Those

Configuring Exchanges

2019-03-04 Thread Dave Beckstrom
olr_1" if the url for the page contains the text "somedir" in the path. I tried the following and it doesn't work. I also tried with "url" instead of "host" as the field being checked. Any suggestions? Thanks! Best,

JEXL and Exchanges

2019-03-05 Thread Dave Beckstrom
Ryan and Roannel, Thank you guys so much for your replies. I didn't realize it but I was not seeing all of the emails from you. Roannel you sent some really helpful replies that never came in as an email. I found your replies when I browsed the web-based archives on the apache site. I wanted

Re: JEXL and Exchanges

2019-03-05 Thread Dave Beckstrom
figured that out without the clue you provided. The exchanges are working, content is going into the right collections, life is good! Thank you again! Best, Dave Beckstrom *Fig Leaf Software* <http://www.figleaf.com/> | "We've Got You Covered" *Service-Disabled Veteran-Owned Small

parser.html.NodesToExclud

2019-09-12 Thread Dave Beckstrom
at nodes with these attributes, and their children, will be silently ignored by the parser so verify the indexed content with Luke to confirm results. Regards, Dave Beckstrom Technical Delivery Manager / Senior Developer em: dbeckst...@collectivefls.com ph: 763.323.3499 -- *Fig Leaf

Re: Injection from webservice

2019-09-16 Thread Dave Beckstrom
Or use a scheduled wget job to pull them from the remote server and store them on a path that Nutch can access locally. Regards, Dave Beckstrom Technical Delivery Manager / Senior Developer em: dbeckst...@collectivefls.com ph: 763.323.3499 On Mon, Sep 16, 2019 at 12:14 PM Jorge Betancourt

Nutch not crawling all pages

2019-10-30 Thread Dave Beckstrom
Hi Everyone, I googled and researched and I am not finding any solutions. I'm hoping someone here can help. I have txt files with about 50,000 seed urls that are fed to Nutch for crawling and then indexing in SOLR. However, it will not index more than about 39,000 pages no matter what I do.

Re: Nutch not crawling all pages

2019-10-30 Thread Dave Beckstrom
> > > What is the output of the inject command, ie, when you inject the 5 > > seeds justo before generating the first segment? > > > > On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom < > dbeckst...@collectivefls.com> > > wrote: > > > > &g

Re: Nutch not crawling all pages

2019-10-30 Thread Dave Beckstrom
Hi Markus, Thank you so much for the reply and the help! The seed URL list is generated from a CMS. I'm doubtful that many of the urls would be for redirects or missing pages as the CMS only writes out the urls for valid pages. It's got me stumped! Here is the result of the readdb. Not sure

Crawl Command Question

2019-10-19 Thread Dave Beckstrom
Hi Everyone, Reading the help for the nutch crawl script, I have a question. If I run the crawl script without the -i parameter, does that mean the crawl will run and complete without updating SOLR? I need to crawl pages without updating SOLR. Then I'll use solrindex to push the crawled

Excluding individual pages?

2019-10-10 Thread Dave Beckstrom
Hi Everyone, I searched and didn't find an answer. Nutch is indexing the content of the page that has the seed urls in it and then that page shows up in the SOLR search results. We don't want that to happen. Is there a way to have nutch crawl the seed url page but not push that page into

Re: Nutch excludeNodes Patch

2019-10-10 Thread Dave Beckstrom
don't have a lot to go on to debug the issue. The plugin has logic to > > enable logging: > > > > if (LOG.isTraceEnabled()) > > +LOG.trace("Stripping " + pNode.getNodeName() + "#" + > > idNode.getNodeValue()); > > > > But nothing shows in the log files whe

Nutch excludeNodes Patch

2019-10-09 Thread Dave Beckstrom
LR. I could really use some help and suggestions! Thank you! Dave Beckstrom -- *Fig Leaf Software is now Collective FLS, Inc.* * * *Collective FLS, Inc.*  https://www.collectivefls.com/ <https://www.collectivefls.com/> 

metatags missing with parse-html

2019-10-11 Thread Dave Beckstrom
Hi Everyone, It seems like I take 1 step forward and 2 steps backwards. I was using parse-tika and I needed to change to parse-html in order to use a plug-in for excluding content such as headers and footers. I have the excludes working with the plug-in. But now I see that all of the metatags