You can create a core manually in the file system, in a specific place where
solr looks for cores when it starts up. I have mine in /opt/solr/server/sol. It
at least works in solr 5.4.1 (I haven't tried others).
The core needs a conf dir and a properties file. The properties file should
Hello nutchers!
I am trying to compute linkrank scores without spending excessive time on the
task. My version of the crawl script contains the following line, which is
similar to a commented-out line in the bin/crawl script in the 1.12
distribution.
__bin_nutch webgraph $commonOptions -filter
Yes, after a couple of rounds there are many, many hosts in the crawldb. Here
are statistics after a bunch of rounds. It seems like we should be able to have
a bunch of threads going.
16/11/05 06:38:45 INFO crawl.CrawlDbReader: Statistics for CrawlDb:
/orgs/data/crawldb
16/11/05 06:38:45 INFO
Can anyone point me to some good information on how to optimize crawling speed
while maintaining politeness?
My current situation is that Nutch is running reliably for me on a single
hadoop node. Before bringing up additional nodes, I want to make it go
reasonably fast on this one node. At the
nt: Monday, November 21, 2016 10:34 AM
Subject: Re: indexing to Solr
Hi Michael,
On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote:
> From: Michael Coffey <mcof...@yahoo.com.invalid>
> To: "user@nutch.apache.org" <user@nutch.apache.org
What is the best version of Solr to use with Nutch 1.12? Does the 6.3.0 version
work well?
Where can I find up-to-date information on indexing to Solr? When I search the
web, I find tutorials that use the deprecated solrindex command. I also find
questions where people want to know why it doesn't work.
I have a good nutch 1.12 installation on a working hadoop cluster and a Solr
6.3.0
I decided to plunge ahead with Solr indexing, but so far it doesn't work. The
first error I got is listed below. Could it be that I am running JDK 7 on the
nutch server and JDK 8 on the Solr server. As far as I know Nutch 1.x won't
work with JDK 8 and Solr 6.3 wont work with JDK less than 8.
ovember 15, 2016 12:09 AM
Subject: Re: How can I Score?
Hi Michael,
Replies inline
On Sat, Nov 12, 2016 at 7:10 PM, <user-digest-h...@nutch.apache.org> wrote:
> From: Michael Coffey <mcof...@yahoo.com.invalid>
> To: "user@nutch.apache.org" <user@nutch.apache.or
When the generator is used with -topN, it is supposed to choose the
highest-scoring urls. In my case, all the urls in my db have a score of zero,
except the ones injected.
How can I cause scores to be computed and stored? I am using the standard crawl
script. Do I need to enable the various
I think this is what Lewis and Furkan know as NUTCH-2267. I get the same
problem with Solr 5.5.3.
I really would like to know which versions of nutch/solar work together "out of
the box".
From: Michael Coffey <mcof...@yahoo.com.INVALID>
To: "user@nutch.apache.org&qu
What is the best version of Hadoop to use with Nutch 2.3.1? I see that the
runtime/local/lib contains .jar files for hadoop 2.5.2. Does that mean I should
use 2.5.2, or would a newer version like 2.6.5 be even better?
When you say that 1.x is more stable, what does that mean?
From: Markus Jelsma
To: "user@nutch.apache.org"
Sent: Monday, October 31, 2016 9:39 AM
Subject: RE: Nutch 1.x or 2.x
Hello - if you want to crawl big, performance is not
found it yet.
From: Julien Nioche <lists.digitalpeb...@gmail.com>
To: "user@nutch.apache.org" <user@nutch.apache.org>; Michael Coffey
<mcof...@yahoo.com>
Sent: Wednesday, November 2, 2016 9:51 AM
Subject: Re: Nutch 1.x on hadoop
Michael,
See
http://
Does db.ignore.external.links accept only relative urls? I am crawling a site,
let's call it http://www.xyz.com. It contains things like http://www.xyz.com/business.html; >.
Those urls don't end up in the crawldb, but ones with relative urls do. Is this
normal, or am I confused?
That makes a lot of sense. I had a problem with the tracking UI that I had to
solve by disabling IPV6 om my machine. Now it is working better!
From: Julien Nioche <lists.digitalpeb...@gmail.com>
To: "user@nutch.apache.org" <user@nutch.apache.org>; Michael Coff
I'm having trouble trying to get Nutch 1.12 to run on hadoop 2.7.3.
I get a class not found exception for org.apache.nutch.crawl.Crawl, as in the
following attempt.
$HADOOP_HOME/bin/hadoop jar
"/home/mjc/apache-nutch-1.12/runtime/deploy/apache-nutch-1.12.job"
org.apache.nutch.crawl.Crawl seed
Newbie question: I am trying to decide between Nutch 1.x or 2.x. The
application is to crawl a large portion of the www using a massive number
(thousands) of small machines (<= 2GB RAM each). I like the idea of the simpler
architecture and pluggable storage backend of 2.x. However, I am
Sebastian
On 12/09/2016 02:15 AM, Michael Coffey wrote:
> I sometimes get a bunch of warning messages that say Thread #x hung while
> processing
> Is this just a normal thing to see occasionally, or should I look to find
> some resolution? I do have an example where the same hos
y sometimes?
Thanks,
Sebastian
On 12/09/2016 04:58 PM, Michael Coffey wrote:
> The property fetcher.parse is false and I pass -noParsing to the fetch
> command. What other post-fetch actions are there?
>
>
> From: Sebastian Nagel <wastl.na...@googlemail.com>
&g
4 AM
Subject: Re: indexing to Solr
Hi Michael,
On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote:
> From: Michael Coffey <mcof...@yahoo.com.invalid>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Date: Fri, 18 Nov 2
mcgibbney <lewi...@apache.org>
To: "user@nutch.apache.org" <user@nutch.apache.org>
Sent: Monday, November 21, 2016 10:34 AM
Subject: Re: indexing to Solr
Hi Michael,
On Sat, Nov 19, 2016 at 8:09 AM, <user-digest-h...@nutch.apache.org> wrote:
> From: Michael Coffe
I sometimes get a bunch of warning messages that say Thread #x hung while
processing
Is this just a normal thing to see occasionally, or should I look to find some
resolution? I do have an example where the same host shows up on a multitude of
these messages, which puzzles me. I think there
CloudSolrServer, ConcurrentUpdateSolrServer,
HttpSolrServer or LBHttpSolrServer respectively.
solr.server.url
http://solr5-00:8983/solr/nutch-0
Defines the Solr URL into which data should be indexed using the
indexer-solr plugin.
From: Michael Coffey <m
in the master branch.
I am willing to work on some Java code, if necessary, to help resolve this. At
this point, I don't know what to try next, other than switching to
ElasticSearch.
From: Michael Coffey <mcof...@yahoo.com.INVALID>
To: "user@nutch.apache.org" <user@nutch.apache.
Is it possible to get around this problem by using an older version of Solr or
Nutch or both?
From: Michael Coffey <mcof...@yahoo.com.INVALID>
To: "user@nutch.apache.org" <user@nutch.apache.org>; Furkan KAMACI
<furkankam...@gmail.com>; Michael Coffey <mco
/server/solr-webapp/webapp/WEB-INF/lib/httpclient-4.4.1.jar
thanks again
From: Furkan KAMACI <furkankam...@gmail.com>
To: Michael Coffey <mcof...@yahoo.com>
Cc: "user@nutch.apache.org" <user@nutch.apache.org>
Sent: Thursday, December 22, 2016 10:29 AM
Subject: R
ent versions of Solr has not helped (6.3.0,
5.5.3, 5.4.1). FWIW, I have same version of Java on both machines.
OpenJDK Runtime Environment (IcedTea 2.6.8) (7u121-2.6.8-1ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.121-b00, mixed mode)
From: Michael Coffey <mcof...@yahoo.
kam...@gmail.com>
To: Michael Coffey <mcof...@yahoo.com>; user@nutch.apache.org
Sent: Monday, December 19, 2016 4:13 PM
Subject: Re: nutch 1.12 and Solr 5.4.1
Hi Michael,
Could you check the version of solrj at your Nutch and compare it with version
of Solr at your server?
Kind Re
maller, esp. if the linkdb includes also internal
links. Best,
Sebastian On 04/03/2017 02:08 AM, Michael Coffey wrote:
> In my situation, I find that linkdb merge takes much more time than fetch and
> parse combined,
even though fetch is fully polite.
>
> What is the standard advice
In my situation, I find that linkdb merge takes much more time than fetch and
parse combined, even though fetch is fully polite.
What is the standard advice for making linkdb-merge go faster?
I call invertlinks like this:
__bin_nutch invertlinks "$CRAWL_PATH"/linkdb
I want to find out what the crawldb knows about some specific urls. According
to the nutch wiki, I should use nutch readdb with the -url option. But when I
do a command like the following, I get nasty "can't find class" exceptions.
$NUTCH_HOME/runtime/deploy/bin/nutch readdb
In the standard crawl script, there is a _bin_nutch updatedb command and, soon
after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with
"crawldb /path/to/crawl/db" in their names (in addition to the actual
deduplication job).
In my situation, the "crawldb" job launched by
I know this might be more of a SOLR question, but I bet some of you know the
answer.
I've been using Nutch1.12 + SOLR 5.4.1 successfully for several weeks, but
suddenly I am having frequent problems. My recent changes have been (1)
indexing two segments at a time, instead of one, and (2)
er running the dedup job with the common options
fixes your
problem.
That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks,
Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote:
> In the standard crawl script, there is a _bin_nutch updatedb command and,
> soon afte
ming
solr for the misleading messages!
Hi Michael,
What do you have in your Solr logs?
Kind Regards,
Furkan KAMACI
2 May 2017 Sal, saat 02:45 tarihinde Michael Coffey
<mcof...@yahoo.com.invalid> şunu yazdı:
> I know this might be more of a SOLR question, but I bet some of you know
>
as shown by
the Hadoop resource
manager webapp, see screenshot. It's also indicated from where a configuration
property is
set.
Best,
Sebastian
On 05/02/2017 12:57 AM, Michael Coffey wrote:
> Thanks, I will do some testing with $commonOptions applied to dedup. I
> suspect that the
dedup-
I am looking for a methodology for making the crawler cycle go faster. I had
expected the run-time to be dominated by fetcher performance but, instead, the
greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update +
generate-select.
Can anyone provide an outline of such a
I am looking for a methodology for making the crawler cycle go faster. I had
expected the run-time to be dominated by fetcher performance but, instead, the
greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update +
generate-select.
Can anyone provide an outline of such a
I really need overlapping crawl cycles to take advantage of a non-standard
hardware platform I am required to use.
I will rephrase my question: If I set generate.update.crawldb=true, will I be
able call the generator more than once without explicitly calling the
crawldb-update in between those
In search of more effective parallelism, I have been experimenting with
different schemes for organizing the nutch jobs. I would like to know if the
Generator can work in a way that supports what I'm trying to do.
Here is a pseudocode description of one approach. I use variables named curSegs
Yes, it's true, it certainly does makes things more complicated.
For example, now that I turn on generate.update.crawldb, suddenly there is a
permissions problem where it tries to create a temp file directory
mapred/temp/generate-temp-. Is that a known bug?
Hi,
Yes, enable
I am trying do develop a news crawler and I want to prohibit it from wandering
too far away from the seed list that I provide.
It seems like I should use the DepthScoringFilter, but I am having trouble
getting it to work. After a few crawl cycles, all the _depth_ metadata say
either 1 or 1000.
van Hemert | alterNET internet BV <ji...@alternet.nl>
To: user <user@nutch.apache.org>
Sent: Tuesday, September 19, 2017 11:43 PM
Subject: Re: depth scoring filter
Hi,
On 20 September 2017 at 06:36, Michael Coffey <mcof...@yahoo.com.invalid>
wrote:
> I am trying do de
So, I had these numbers in my index:
Num Docs: 189550Max Docs: 285531
Deleted Docs: 95981
Then I did a crawl and index, which told meindexed (add/update): 13,423
And now I have these numbers in my index:
Num Docs: 190785Max Docs: 223339Deleted Docs: 32554So, I am completely
confused. I don't
Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 bug
that was fixed in version 1.4.
Some more bits of information: the indexer job rarely fails (only 1 of the last
99 segments) but the cleaning job fails every time now. Once again, this is
Nutch 1.12 and Solr 5.4.1. I
Hello Nutchians,
I need to be able to query a (nutch 1.x) crawldb for read-only
search/sort/summarize purposes, based on combinations of status, fetch_time,
score, and things like that. What is a good tool or process for doing such
things?
Up until now, I've been doing readdb-dump and then
09/22/2017 04:57 AM, Michael Coffey wrote:
> I am still having trouble with the depth scoring filter, and now I have a
> simpler test case. It does work, somewhat, when I give it a list of 50 seed
> URLs, but when I give it a very short list, it fails.
> I have tried depth.max values i
If the Inject command does filtering, then the documentation should say so. The
page https://wiki.apache.org/nutch/bin/nutch%20inject does not mention any
filtering or normalization. I find it very counter-intuitive that an injection
operation would delete existing data.
Should I edit that
Perhaps my strangest question yet!
Why does Inject delete URLs from the crawldb and how can I prevent it?
I was trying to add 2 new sites to an existing crawldb. According to readdb
stats, about 10% of my URLs disappeared in the process.
(before injecting)17/09/27 19:22:33 INFO
With my new news crawl, I would like to keep web pages in the index, even after
they have disappeared from the web, so I can continue using them in
machine-learning processes. I thought I could achieve this by avoiding running
cleaning jobs. However, I still notice increasing numbers of
Lately, I have seen many tasks and jobs fail in Solr when doing nutch index and
nutch clean.
Messages during indexing look like this.
17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99%
17/08/24 19:19:36 INFO mapreduce.Job: Task Id :
attempt_1502929850483_1329_r_07_2, Status : FAILED
8 chars when indexing or cleaning
>
> From the logs looks like the error is coming from the Solr side, do you
> mind checking/sharing the logs on your Solr server? Can you pin point which
> URL is causing the issue?
> Best Regards, Jorge
>
> On Tue, Aug 29, 2017 at 9:2
I think I have an instance of the known bug
https://issues.apache.org/jira/browse/NUTCH-2186
I need to keep raw html in my Solr index (or somewhere) so that an external
tool can access it and parse it. So, I added addBinaryContent and base64 to my
indexing command. On the very first segment, I
Thanks for the reply!
I'm not sure the best way to illustrate the issue, as I struggle with solr log
management within docker. However, here are a few URLs that have exhibited the
problem. In each case, Solr complains "Error adding field 'binaryContent'" ...
"msg=String length must be a
I am curious, is it possible to send boilerpipe output to Solr as a separate
"plaintext" field, in addition to the usual "content" field (rather than
replacing it)? If so, would someone please give an overview of how to do it?
Also, try the boilerpipe demo online at https://boilerpipe-web.appspot.com/
From: Markus Jelsma
To: "user@nutch.apache.org"
Sent: Wednesday, November 15, 2017 2:06 PM
Subject: RE: [MASSMAIL]RE: Removing
I found a lot of detail about the boilerpipe algortithm in
http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf
Seems like very short paragraphs can be a problem, since one of the primary
features used for determining boilerplate is the length of a given text block.
I would
Greetings Nutchlings,
I have been using readseg-dump successfully to retrieve content crawled by
nutch, but I have one significant problem: many non-ASCII characters appear as
'???' in the dumped text file. This happens fairly frequently in the headlines
of news sites that I crawl, for things
TML encoding (the
code is available
in Nutch) and then convert the byte[] content using the right encoding.
Best,
Sebastian
On 11/15/2017 02:20 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I have been using readseg-dump successfully to retrieve content crawled by
> nutch, but I
intervals and scores
in the crawl db.
From: Michael Coffey <mcof...@yahoo.com.INVALID>
To: User <user@nutch.apache.org>
Sent: Friday, November 24, 2017 3:13 PM
Subject: need to override refetch intervals
In order to achieve the most timely crawling o
In order to achieve the most timely crawling of news sites, I want to be able
to manipulate the refetch intervals and scores in the crawl db. I thought I
could accomplish that by re-injecting the urls that should be re-fetched most
often. According to the documentation, it seems I should be
I bet that problem affects a lot of people. It certainly has affected me.
Why isn't essential filtering ON by default?
The bin/crawl script doesn't even have a way for the operator to specify any
filltering. And nowhere, in the tutorial, is it mentioned that you need to
specify "-filter" to
follows redirects)
It's most efficient not filter the CrawlDb. It's costly to apply the filters
again and again: the CrawlDb might be huge (up to billions of URLs),
and/or filter rules can be complex. The default does the necessary but avoid
unnecessary work.
Best,
Sebastian
On 11/29/2017 05:07 P
Is it possible to purge low-scoring urls from the crawldb? My news crawl has
many thousands of zero-scoring urls and also many thousands of urls with scores
less than 0.03. These urls will never be fetched because they will never make
it into the generator's topN by score. So, all they do is
07.
4. a more reliable solution would require to detect the HTML encoding (the code
is available
in Nutch) and then convert the byte[] content using the right encoding.
Best,
Sebastian
On 11/15/2017 02:20 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I have been using readseg-du
That is a very interesting note. I have been wanting something like that. I use
the python-based "newspaper" package but it is not directly compatible with the
nutch/hadoop infrastructure.
From: Jorge Betancourt
To: user@nutch.apache.org
Cc:
I want to blacklist certain top-level domains for a very large web crawl. I
tried using the domainblacklist urlfilter in Nutch 1.12, but that doesn't seem
to work.
My domainblacklist-urlfilter.txt contains lines like the following.
cn
jp
line.me
albooked.com
booked.co.il
The TLDs do not get
Just to clarify: .99 does NOT work fine. It should have rejected most of the
records when I specified "((Math.random())>=.99)".
I have used expressions not involving Math.random. For example, I can extract
records above a specific score with "score>1.0". But the random thing doesn't
work even
I want to extract a random sample of URLS from my big crawldb. I think I should
be able to do this using readdb -dump with a Jexl expression, but I haven't
been able to get it to work.
I have tried several variations of the following command.
$NUTCH_HOME/runtime/deploy/bin/nutch readdb
I guess there is no solution or workaround for the addBinaryContent bug, so I
have to write code to read directly from segment data. If not writing Java, I
guess I have to do readseg-dump and then parse the output text file.
-- original message --
I think I have an instance of the known bug
I am having a problem crawling some sites that seem to be transitioning to
https. All their links contain http urls and the fetcher gets response code 301
and content that says "the document has moved" because the actual content is
accessible only via https. This has been happening for a few
ter which may cause that the redirect targets are filtered?
On 03/09/2018 08:39 PM, Michael Coffey wrote:
> I am having a problem crawling some sites that seem to be transitioning to
> https. All their links contain http urls and the fetcher gets response code
> 301 and content that says &quo
Greetings Nutchlings,
I would like to make my generate jobs go faster, and I see that the reducer
spills a lot of records.
Here are the numbers for a typical long-running reduce task of the
generate-select job: 100 million spilled records, 255K input records, 90k
output records, 13G file bytes
temporary) is on
SSDs
try different compression settings (CrawlDb and temporary data), see
mapreduce.output.fileoutputformat.compress.codec
mapreduce.map.output.compress
mapreduce.map.output.compress.codec
Best,
Sebastian
On 04/13/2018 02:52 AM, Michael Coffey wrote:
> Greetings Nutchlings,
&g
I think you will find that you need different rules for each website and that
some amount of maintenance will be needed as the websites change their
practices.
Greetings Nutchlings,
How can I identify segments that are no longer useful, now that I have been
using AdaptiveFetchSchedule for several months?
I have db.fetch.interval.max = 31536000 (365 days), but I know that tons of
pages get re-fetched every 30-60 days because I have
But all the old segment data is still sitting there in hdfs.
On Friday, March 23, 2018, 1:34:21 PM PDT, Sebastian Nagel <> wrote:
Hi Michael,
when segments are merged only the most recent record of one URL is kept.
Sebastian
On 03/23/2018 09:25 PM, Michael Coffey wrote:
>
Do current 1.x versions of Nutch (1.18, and trunk/master) work with versions of
Hadoop greater than 3.1.3? I ask because Hadoop 3.1.3 is from October 2019, and
there are many newer versions available. For example, 3.1.4 came out in 2020,
and there are 3.2.x and 3.3.x versions that came out this
79 matches
Mail list logo