Hi Everyone,
This is probably more of a SOLR question but I'm hoping someone might be
able to help. I'm using Nutch to crawl and index some content. It failed
on a SOLR field defined as a text field when it was trying to insert the
following value for the field:
your Solr
> collection.
>
> Best regards,
> Jorge
>
> On Tue, Mar 26, 2019 at 9:41 PM Dave Beckstrom
> wrote:
>
> > Hi Everyone,
> >
> > This is probably more of a SOLR question but I'm hoping someone might be
> > able to help. I'm using Nutch to
I'm getting much closer to getting Nutch and SOLR to play well together.
(Ryan - thanks for your help on my last question. Your suggestion fixed
that issue)
What is happening now is that Nutch finishes crawling, then calls the
index-writer to update solr. The SOLR update fails with this
Hi Everyone,
I'm a developer and I am installing Nutch with Solr for a client. I've
been reading everything I can get my hands on and I am just not finding the
answers to some questions. I'm really hoping you can help!
I have Nutch 1.15 and Solr 7.3.1 installed on a Windows server. Those
olr_1" if the url for the page contains
the text "somedir" in the path. I tried the following and it doesn't
work. I also tried with "url" instead of "host" as the field being checked.
Any suggestions?
Thanks!
Best,
Ryan and Roannel,
Thank you guys so much for your replies. I didn't realize it but I was not
seeing all of the emails from you.
Roannel you sent some really helpful replies that never came in as an
email. I found your replies when I browsed the web-based archives on the
apache site. I wanted
figured
that out without the clue you provided.
The exchanges are working, content is going into the right collections,
life is good!
Thank you again!
Best,
Dave Beckstrom
*Fig Leaf Software* <http://www.figleaf.com/> | "We've Got You Covered"
*Service-Disabled Veteran-Owned Small
at nodes with these attributes, and their children, will be
silently ignored by the parser so verify the indexed content
with Luke to confirm results.
Regards,
Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: dbeckst...@collectivefls.com
ph: 763.323.3499
--
*Fig Leaf
Or use a scheduled wget job to pull them from the remote server and store
them on a path that Nutch can access locally.
Regards,
Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: dbeckst...@collectivefls.com
ph: 763.323.3499
On Mon, Sep 16, 2019 at 12:14 PM Jorge Betancourt
Hi Everyone,
I googled and researched and I am not finding any solutions. I'm hoping
someone here can help.
I have txt files with about 50,000 seed urls that are fed to Nutch for
crawling and then indexing in SOLR. However, it will not index more than
about 39,000 pages no matter what I do.
>
> > What is the output of the inject command, ie, when you inject the 5
> > seeds justo before generating the first segment?
> >
> > On Wed, Oct 30, 2019 at 3:18 PM Dave Beckstrom <
> dbeckst...@collectivefls.com>
> > wrote:
> >
> > &g
Hi Markus,
Thank you so much for the reply and the help! The seed URL list is
generated from a CMS. I'm doubtful that many of the urls would be for
redirects or missing pages as the CMS only writes out the urls for valid
pages. It's got me stumped!
Here is the result of the readdb. Not sure
Hi Everyone,
Reading the help for the nutch crawl script, I have a question. If I run
the crawl script without the -i parameter, does that mean the crawl will
run and complete without updating SOLR? I need to crawl pages without
updating SOLR. Then I'll use solrindex to push the crawled
Hi Everyone,
I searched and didn't find an answer.
Nutch is indexing the content of the page that has the seed urls in it and
then that page shows up in the SOLR search results. We don't want that to
happen.
Is there a way to have nutch crawl the seed url page but not push that page
into
don't have a lot to go on to debug the issue. The plugin has logic to
> > enable logging:
> >
> > if (LOG.isTraceEnabled())
> > +LOG.trace("Stripping " + pNode.getNodeName() + "#" +
> > idNode.getNodeValue());
> >
> > But nothing shows in the log files whe
LR.
I could really use some help and suggestions!
Thank you!
Dave Beckstrom
--
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.*
https://www.collectivefls.com/ <https://www.collectivefls.com/>
Hi Everyone,
It seems like I take 1 step forward and 2 steps backwards.
I was using parse-tika and I needed to change to parse-html in order to use
a plug-in for excluding content such as headers and footers.
I have the excludes working with the plug-in. But now I see that all of
the metatags
17 matches
Mail list logo