Hi Sachin,
practically every Nutch tool (inject, generate, fetch, parse, update, index)
can filter (and normalize) URLs. Because filtering and normalizing is expensive
only the steps which add new URLs (inject and parse) do this by default (see
bin/crawl).
For your use case you might instead
Hi Markus,
I've tested in pseudo-distributed mode with Hadoop 3.2.1,
including indexing into Solr. It worked.
Could be a dependency version issue similar to that
causing NUTCH-2706. But that's only an assumption.
Since the IndexWriters.describe() is for help only,
I would just deactivate this
Hi Dave,
could you share an example document? Which Nutch version is used?
I tried to reproduce the problem without success using Nutch v1.16:
- example document:
Test metatags
test for metatag extraction
- using parse-html (works)
> bin/nutch indexchecker -Dmetatags.names='*' \
Hi folks!
The Apache Nutch [0] Project Management Committee are pleased to announce
the immediate release of Apache Nutch v1.16. We advise all current users
and developers to upgrade to this release.
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained
Betancourt Gonzalez *
Sebastian Nagel *
[0] -1 Do not release this package because ...
* Nutch PMC
The VOTE passes with 6 binding votes from Nutch PMC members.
I'll continue and publish the release packages. Tomorrow, after the
packages have been propagated to all mirrors, I'll send
Hi Folks,
thanks to everyone who was able to review the release candidate!
72 hours have passed, please see below for vote results.
[4] +1 Release this package as Apache Nutch 2.4
Lewis John McGibbney *
Jorge Luis Betancourt Gonzalez *
Furkan Kamaci *
Sebastian Nagel *
[0] -1 Do
Hi Markus,
> 2019-10-03 12:48:49,696 INFO crawl.Generator - Generator: number of items
> rejected during selection:
> 2019-10-03 12:48:49,698 INFO crawl.Generator - Generator: 1
> SCHEDULE_REJECTED
see NUTCH-2737 Generator: count and log reason of rejections during selection
- useful
Hi Folks,
A first candidate for the Nutch 1.16 release is available at:
https://dist.apache.org/repos/dist/dev/nutch/1.16/
The release candidate is a zip and tar.gz archive of the binary and sources in:
https://github.com/apache/nutch/tree/release-1.16
In addition, a staged maven
is loaded including
the version number. I've opened
https://issues.apache.org/jira/browse/NUTCH-2741
to remove it.
Best,
Sebastian
On 28.09.19 17:54, lewis john mcgibbney wrote:
> Hi Seb,
>
> On Thu, Sep 26, 2019 at 4:37 AM wrote:
>
>> From: Sebastian Nagel
>> To: user@n
Hi Folks,
A first candidate for the Nutch 2.4 release is available at:
https://dist.apache.org/repos/dist/dev/nutch/2.4/
The release candidate is a zip and tar.gz archive of sources in:
https://github.com/apache/nutch/tree/release-2.4
In addition, a staged maven repository is available
Hi Dave,
the boilerplate removal (boilerpipe) works if parse-tika is used for parsing,
but the parser.html.NodesToExclude property belongs to a feature which never
made it into the code base, see
https://issues.apache.org/jira/browse/NUTCH-585
Or do you work with a patched version?
Best,
Hi all,
more than 90 issues are fixed now:
https://issues.apache.org/jira/projects/NUTCH/versions/12343430
The last release (1.15) is already more than one year ago (July 25, 2018).
It's time! Of course, we'll check all remaining issues whether they should
be fixed now or can be moved to be
log of any issues that need to be resolved for the wiki?
>
> Regards,
> Sid
>
> -Original Message-
> From: Sebastian Nagel
> Sent: August 10, 2019 2:43 AM
> To: user@nutch.apache.org
> Subject: Re: Few inner links are not openi
, Sadiki Latty wrote:
> Hey Sebastian,
>
> I have signed up for an account I will try to help out where/when I can. Is
> there a list/backlog of any issues that need to be resolved for the wiki?
>
> Regards,
> Sid
>
> -Original Message-----
> From: Sebastian Nagel
&
Thanks, it's fixed now. The wiki has been migrated recently and looks like
that inner links haven't been properly converted.
If anybody is eager to help us and improve the Nutch wiki - you're welcome!
Please assign for an account in the wiki. Nutch is a community project and
we need your help.
s,
> Furkan KAMACI
>
> On Fri, Jul 26, 2019 at 12:39 PM Sebastian Nagel
> wrote:
>
> Hi all,
>
> the Nutch wiki has been migrated from MoinMoin to Confluence.
>
> You'll find it now on
> https://cwiki.apache.org/confluence/display/NUTCH/Hom
Hi all,
the Nutch wiki has been migrated from MoinMoin to Confluence.
You'll find it now on
https://cwiki.apache.org/confluence/display/NUTCH/Home
Work on improving the Wiki - updating information and moving outdated stuff
into "Archive and Legacy" - is ongoing. Help is welcome, if you want
Hi,
if server S3 has Solr running, this would be a simple change of
- (Nutch 1.14) just change the property solr.server.url
- (Nutch 1.15) see https://wiki.apache.org/nutch/IndexWriters
Best,
Sebastian
On 7/22/19 5:30 PM, Rushi wrote:
> Hi All,
> I need some help on this ,I have two different
Hi Ryan,
could be caused by the managed schema. Note for Solr 7.x updating the schema.xml
alone may be not sufficient, see
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search
Let us know whether this works. Thanks!
And we'll update the wiki page, resp. in the new wiki:
Let me try that.
>>
>>
>> On Tue, Jul 9, 2019 at 10:15 AM Sebastian Nagel
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> there is one:
>>>
>>> >> action="/user/login"
>>> method="post" id="
value="spid3r_us"/>
>>
>>
>>> value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3)
>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100
>> Safari/537.36"
>> />
&
Hi,
the error message is quite clear:
> 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication - No form
> element found with 'id' = user-login-form, trying 'name'.
> 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication - No form
> element found with 'name' = user-login-form
Hi Gajanan,
> Can the *scoring-similarity plugin* for Nutch 1.x be *modified* to run with
> nutch 2.3.1? if yes, how?
Eventually, yes. Have a look at the differences of another scoring filter
plugin
between 1.x and 2.x, and try to apply those to scoring-similarity.
> Can somebody guide me on
leaving the rest in a common directory does the trick!
> Being able to configure the file names would sure be nice but for now I don't
> mind having separate directories.
>
> Felix
>
>> Von: Sebastian Nagel
>>
>> Hi Felix,
>>
>> assumed that every t
ta)
>
> the parse metadata only contains "metatag.robots" while with this setup
>
> protocol-httpclient|parse-(html|metatags)|index-(metadata)
>
> the parse metadata contains both "metatag.robots" and "robots".
>
> Felix
>
Hi Felix,
I tried to reproduce the problem. The parse-metatags plugin only duplicates the
"robots" metatags,
adding it also as "metatag.robots" but keep the original "robots".
That is the case using the current master:
- with parse-metatags and metatags.names="robots" the ParseData object
Hi Felix,
assumed that every test crawl runs by its own not sharing resources with other
test crawls
(except the Nutch packages): you may just write a separate index-writers.xml
for every test, place
it in a separate directory and point NUTCH_CONF_DIR to this directory.
This works only in local
Hi Michael,
can you provide a patch or pull request for the upgrade?
There is an issue open since long [1] but the available
patches are reported to raise further issues (see issue comments).
The challenge is indeed to to test all the authentication options
supported by protocol-httpclient
Hi Ryan,
you may have a look at the plugin scoring-depth.
It tracks the depth (links away from one of the seeds)
of a crawled page and could be modified to write also
the parents (maybe only the first) into the CrawlDatum
metadata.
Best,
Sebastian
On 4/9/19 9:08 PM, Ryan Suarez wrote:
>
Hi,
in deploy mode there are usually also jars from the Hadoop installation in the
classpath.
That might cause the issue. Because the Hadoop job client communicates via HTTP
with the other Hadoop components these conflicts are not easy to fix.
You could try to build Nutch yourself adding
7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
>
>
> -Original Message-
> F
e referring to?
>
> Thanks
> Srini
>
> On Thu, Mar 14, 2019 at 1:06 PM Sebastian Nagel <mailto:wastl.na...@googlemail.com>> wrote:
>
> > remove from index, but later we found that some valid pages (when we
> curl
> > them we get 200) are al
Hi,
if running in local mode, it's better passed via ENV to bin/nutch, cf.
# Environment Variables
#
# NUTCH_JAVA_HOME The java implementation to use. Overrides JAVA_HOME.
#
# NUTCH_HEAPSIZE The maximum amount of heap to use, in MB.
# Default is 1000.
#
# NUTCH_OPTS
Hi,
> Can Nutch index custom HTTP headers?
Nutch stores the HTTP response headers if the property
`store.http.headers` is true. The headers are saved as
string concatenated by `\r\n` under the key
`_response.headers_` in the content metadata.
You can send the entire HTTP headers to the indexer
*Service-Disabled Veteran-Owned Small Business (SDVOSB)*
> 763-323-3499
> dbeckst...@figleaf.com
>
>
> On Tue, Mar 5, 2019 at 12:44 PM Sebastian Nagel
> wrote:
>
>> Hi Dave,
>>
>> I'm by now means an expert of the JEXL syntax (cf.
>> (http:
Hi Dave,
I'm by now means an expert of the JEXL syntax (cf.
(http://commons.apache.org/proper/commons-jexl/reference/syntax.html)
but after a few trials the expression must be
doc.getFieldValue('url')=~'.*/englishnews/.*'
It's easy to test using the indexchecker, e.g.
% bin/nutch indexchecker
eleted the
> lock file, and changed
> the permissions to 755. Still getting on error (image attached).
>
> ----
> *From:* Sebastian Nagel
> *Sent:* Wednesday, February 20, 2019 3:57 PM
>
Hi,
> "chmod 655 "
Shouldn't it be "755"? Otherwise the user is not allowed to list the
content of the directory which will definitely cause an error.
The user running Nutch is required to have "rwx" permissions in the
"crawldb" folder and all its subfolders.
>
o use the Nutch server and monitor the jobs
> and their statuses? I will then delete the failed ones.
>
>
> Regards
> Ameer
>
>
>
> On Wed, Feb 20, 2019 at 8:58 PM Sebastian Nagel
> wrote:
>
>> Hi Ameer,
>>
>> (bringing this back to user@nutch -
Hi Suraj,
the correct syntax would be:
__bin_nutch dedup -Dmapreduce.job.reduces=32 "$CRAWL_PATH"/crawldb
Hadoop configuration properties must be passed before remaining arguments
and you need to pass them as -Dname=value
To confirm: I use to run the dedup job with 1200 reducers on a CrawlDb
s
> being created in the
> *tmp* directory. It also seems slow to me.
>
> Regards
> Ameer
>
>
>
> On Wed, Feb 20, 2019 at 6:10 AM Sebastian Nagel <mailto:wastl.na...@googlemail.com>> wrote:
>
> Hi Ameer,
>
> yes, you're correct. If lau
d. Thanks again.
>
> Thanks & Regards
> Venkata MR
> +91 98455 77125
>
> -Original Message-
> From: Venkata MR
> Sent: 18 December 2018 16:40
> To: 'Sebastian Nagel'
> Cc: user@nutch.apache.org
> Subject: RE: Apache Nutch 2.3.1 not able to fetch conten
Hi,
Nutch loads all configuration files from the Java class path and picks the first
file found on the class path (and ignores other files with the same name).
If there are multiple crawls with different configurations, just place a
crawl-specific
configuration directory in front of the
Yes. They don't get updated and stay in status db_unfetched
and will be generated in the next cycle again.
On 12/18/18 5:01 PM, Suraj Singh wrote:
> Hello,
>
> I want to understand what happens to the URLs which remains Unfetched due to
> fetch time limit.
> Are they fetched in the subsequent
Hi,
> protocol-httpclient (as the websites are with https).
With Nutch 1.15 protocol-selenium supports https. If protocol-httpclient
is also active, it may be used instead of protocol-selenium. There is
no need to activate it, the description in nutch-default.xml needs to
be fixed, see
l: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
>
> -Original Message-
> From: Sebastian Nagel [m
Hi,
yes, of course, the comments just one line above even encourages you to do so:
# note that some of the options listed here could be set in the
# corresponding hadoop site xml param file
For most use cases this value is ok. Only if you're using a parsing fetcher
with many threads you
may
Hi,
the pattern should work. Of course, you need to make sure that
- there are no other patterns coming before in regex-urlfilter.txt
which cause the URL to be rejected
- other URL filters being active which reject the URL
- make sure that the folder of the regex-urlfilter.txt you're editing
;
> bin/nutch parsechecker -dumpText
> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]
> Parse Metadata:
>
> So, default one provides empty metadata and no error messages. This is a bit
> confusing.
>
> Thanks.
>
&
Hi Yossi,
> I think in the case that you interrupt the fetcher, you'll have the problem
> that URLs
> that where scheduled to be fetched on the interrupted cycle will never be
> fetched
> (because of NUTCH-1842).
Yes, but only if generate.update.crawldb is true which is not the case by
> Is there any reasons to keep the default HTML plugin there? only for
> maintenance ?
>
> Semyon.
>
> Sent: Thursday, November 15, 2018 at 2:23 PM
> From: "Sebastian Nagel"
> To: user@nutch.apache.org
> Subject: Re: Quality problems of crawling. Parsing(Mi
Hi Semyon,
I've tried to reproduce your problems using the recent Nutch master (upcoming
1.16).
I cannot see any issues, except that Javascript is not executed but that's
clear.
Of course, you are free to use parse-tika instead of parse-html which is legacy.
See results below.
Best,
Sebastian
Hi Nicholas,
looks like it's the user-agent string sent in the HTTP header
which makes the server return no/empty content.
bin/nutch parsechecker \
-Dhttp.agent.name="mytestbot" \
-Dhttp.agent.version=3.0 \
-Dhttp.agent.url=http://example.com/ https://whatdavidread.ca/
Obviously, the
Hi,
thanks for the problem report. However, I would argue not handle such specificic
cases inside Nutch, it makes the Nutch code extremely complex and requires extra
efforts to be portable among operating systems.
Why not just make the file invisible again?
Or if this isn't possible:
- write
Hi Marco,
did you increase
http.content.limit
The default is 64 kB, saturn.de pages are much larger and it may happen that
the first 64 kB contain always the same set of navigation links (linking to
product categories here).
Feel free to open an issue on
Hi Timeka,
> because Solr is missing the
> files from its packet for it to work.
There are many Solr versions available and it easily may happen that the
description in the Wiki is outdated or not applicable for your combination
of Nutch and Solr.
Please try to give as much information as
Hi Amarnath,
the only possibility is that https://www.abc.com/ is skipped
- by another rule in regex-urlfilter.txt
- or another URL filter plugin
Please check your configuration carefully. You may also use the tool
bin/nutch filterchecker
to test the filters beforehand: every active filter
Wiki
> where is says to install Solr I don't understand the directions given that
> lead up to creating a nutch core..how do I copy resources and manage
> schema,etc..the breakdown confuses me.. Thank you again
>
> Timeka
>
> On Mon, Oct 1, 2018, 7:12 AM Sebastian Nagel
>
Hi Timeka,
well, the really short answer is: Nutch sends "documents" to Solr using
the Solr4j client library. A "document" is a single web page fetched, parsed
and split into indexable fields, e.g., "title", "keywords", "content".
For further information you may look into
Hi,
could you explain in detail what is meant by "parent URL"?
- the page the PDF document is linked from
- a redirect pointing to the PDF doc
- the "directory" of the PDF URL (clip URL after last "/")
- ...
Nutch indexes all successfully fetched pages but not redirects,
404s, etc. Of course,
Hi,
crawling and indexing Office documents should work out-of-the-box without any
configuration changes, the plugin parse-tika is enabled by default in recent
Nutch versions. The only recommended change is to increase the content limit:
http.content.limit
65536
The length limit for
Hi Yossi, hi Lewis,
actually, this is caused by a change of the IndexWriter interface as part of
NUTCH-1480 (multiple
index writers of same type). It's reported as a breaking change, but only the
need to change the way
how the index writers are configured. Sorry, we've missed to add a note
Hi,
please also note that the way the index writer plugins are configured has
changed with 1.15,
see release notes and https://wiki.apache.org/nutch/bin/nutch%20index.
The Solr URL cannot be passed anymore via -Dsolr.server.url=...
I'll update the bin/crawl wiki page.
Thanks,
Sebastian
On
The Apache Nutch [0] Project Management Committee are pleased to announce
the immediate release of Apache Nutch v1.15. We advise all current users
and developers of the 1.X series to upgrade to this release.
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained
Hi Folks,
thanks to everyone who was able to review the release candidate!
72 hours have passed, please see below for vote results.
[4] +1 Release this package as Apache Nutch 1.15
Roannel Fernández Hernández *
Govind Nitk
Markus Jelsma *
Sebastian Nagel *
[0] -1 Do not release
Hi Markus
> 2018-08-01 11:42:10,660 INFO fetcher.FetcherThread - FetcherThread 47
> fetching
https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms)
Ok, non-blocking because of:
User-agent: *
Disallow: /wiki/Special:
> 2018-08-01 11:42:10,660 INFO fetcher.FetcherThread
Hi Fred,
as soon as you generate the fetch list (if you call bin/crawl this is done)
and the CrawlDb contains at this time items with a (re)fetch date in the past,
you'll get an non-empty fetch list and Nutch will (re)fetch those pages.
You always have to call bin/crawl explicitly. Of course,
Hi Folks,
A first candidate for the Nutch 1.15 release is available at:
https://dist.apache.org/repos/dist/dev/nutch/1.15/
The release candidate is a zip and tar.gz archive of the binary and sources in:
https://github.com/apache/nutch/tree/release-1.15
The SHA1 checksum of the archive
2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: boost
>> dest:
>>> boost
>>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: digest
>> dest:
>>> digest
>>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: tsta
; org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
> at
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
> at
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
> a
Hi,
> * Changed my regex-filter to use development domain address.
Did you also change your seeds?
The fact that deletions are sent but not additions/updates
suggests that no pages have been successfully crawled.
Could you specify the Nutch version used and also attach some
log snippets to
Dear all,
it is my pleasure to announce that Roannel Fernández Hernández
has joined us as a committer and member of the Nutch PMC.
Recently, Roannel contributed a long list of improvements related
to the indexer plugins: a new indexer for RabbitMQ, the possibility
to index into multiple
Hi Robert,
why not switching on boilerpipe for parse-tika?
tika.extractor
none
Which text extraction algorithm to use. Valid values are: boilerpipe or none.
tika.extractor.boilerpipe.algorithm
ArticleExtractor
Which Boilerpipe algorithm to use. Valid values are:
Dear all,
it is my pleasure to announce that Omkar Reddy has joined us
as a committer and member of the Nutch PMC. Omkar has worked
on upgrading Nutch to use the new MapReduce API as part of his
Google Summer of Code project last year.
Thanks, Omkar, and congratulations on your new role within
Hi Michael,
on the Common Crawl Nutch fork there is a plugin "fast-urlfilter" which does
this, see
https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
It uses exactly this concept of "domain", i.e.,
definition in nutch-default.xml
On 06/12/2018 02:26 PM, BlackIce wrote:
> PS: Does this work when configured in site.xml like regular metatdata?
>
> On Tue, Jun 12, 2018 at 1:31 PM BlackIce wrote:
>
>> sweet thnx!
>>
>> On Tue, Jun 12, 2018 at 1:29 PM Sebastian Nage
t; ++1!
>>>
>>>
>>>
>>> Sounds great.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Sebastian Nagel
>>>
crawl on Hadoop mid of this week.
But any help in testing is welcome.
Note that the tutorial needs to be updated (will be done after 1.15
is finally released) to reflect the changes related to NUTCH-1480.
Thanks,
Sebastian
[1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster
[2
Hi Markus,
ok, no problem. Done:
https://github.com/crawler-commons/crawler-commons/issues/213
Sebastian
On 06/07/2018 12:21 AM, Markus Jelsma wrote:
> Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine
> Github account. If you would do the honours of opening a
> I agree that the this is not the ideal error behaviour, but I guess the code
> was written from the
assumption that the document is valid and conformant.
Over time the crawler-commons sitemap parser has been extended to get as much
as possible from
non-conforming sitemaps as well. Of course,
Hi Bob,
it's impossible to make any diagnostics without the full log files
the complete configuration and a detailed description what is missing.
It could be a bug, of course. But it's more likely a configuration issue,
you should check the log files. Also have a look at:
- the robots.txt of the
That's trivial. Just run ant in the plugin's source folder:
cd src/plugin/urlnormalizer-basic/
ant
or to run also the tests
cd src/plugin/urlnormalizer-basic/
ant test
Note: you have to compile the core test classes first by running
ant compile-core-test
in the Nutch "root" folder.
nore it unless it causes a problem for my other cores.
>
> Chip
>
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: Monday, April 30, 2018 12:21 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch fetching times out at 3 hou
Hi,
if you still see the log message
fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping!
then it can be only
- fetcher.timelimit.mins
- fetcher.max.exceptions.per.queue
> I crawl a list of roughly 2600 URLs all on my local server
If this is the case you can crawl more
Hi Michael,
> reducer spills a lot of records
The job counter "Spilled Records" is not for the reducers alone.
> 255K input records
Does your CrawlDb only contain 250,000 entries?
Also, how many hosts (resp. domains/ips depending on partition.url.mode)
are in the CrawlDb? Note: the counts per
Hi Fred,
Nutch does nothing "proactively", the crawl jobs must be explicitly called.
But you need no special command:
- let's say the you didn't change the defaults and
db.fetch.interval.default == 30 days
- if you launch bin/crawl one month later, all pages are refetched,
and optionally
Hi Eric,
the ability to add binary content was implemented in Nutch 1.11,
you need to upgrade (an upgrade to 1.14 is recommended).
The command-line help of
$NUTCH_HOME/bin/nutch index
indicates how to add a Solr field with the "binary" HTML content:
Usage: Indexer ... [-addBinaryContent]
Hi Michael,
when segments are merged only the most recent record of one URL is kept.
Sebastian
On 03/23/2018 09:25 PM, Michael Coffey wrote:
> Greetings Nutchlings,
>
> How can I identify segments that are no longer useful, now that I have been
> using AdaptiveFetchSchedule for several
>
>
> On Tue, Mar 20, 2018 at 3:31 AM, Sebastian Nagel <wastl.na...@googlemail.com
>> wrote:
>
>> Hi Robert,
>>
>> unfortunately, I'm not able to reproduce the problem.
>> Fetching works with the recent 1.x and Java 8, I've tried both:
>>
>>
Hi,
> more control over what is being indexed?
It's possible to enable URL filters for the indexer:
bin/nutch index ... -filter
With little extra effort you can use different URL filter rules
during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR
to a different folder.
>> I
Hi Robert,
unfortunately, I'm not able to reproduce the problem.
Fetching works with the recent 1.x and Java 8, I've tried both:
bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html'
https://potomac.edu/
bin/nutch parsechecker
Hi John,
the recent master has seen an upgrade to the new MapReduce API (NUTCH-2375),
it was a huge change which is already known to have introduced some issues.
For production it's recommended to use 1.14 and if necessary patch it.
Could you open a new issue on
Hi Shiva,
1. you can define URL normalizer rules to rewrite the URLs
but it only works for sites where you know which URL is
the canonical form.
2. you can deduplicate (command "nutch dedup") based on the
content checksum: the duplicates are still crawled but deleted
afterwards
It's
gt; seems like I am not able to reopen a closed/resolved issue. Sorry...
>
>> -Original Message-
>> From: Sebastian Nagel <wastl.na...@googlemail.com>
>> Sent: 12 March 2018 17:39
>> To: user@nutch.apache.org
>> Subject: Re: UrlRegexFilter is gett
> example. The only other place I can think of where this may be needed is
> after redirect.
> This is pretty much the same as what Semyon suggests, whether we push it down
> into the filterNormalize method or do it before calling it.
>
> Yossi.
>
>> -Orig
Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long
>> links
>> Some regular expressions (those with backtracing) can be very expensive for
>> lomg strings
>>
>> https://regular-expressions.mobi/catastrophic.html?wlr=1
>>
>> Maybe that
Good catch. It should be renamed to be consistent with other properties, right?
On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> Perhaps, however it starts with db, not linkdb (like the other linkdb
> properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code
> uses the property
Hi Yossi,
it's used in FetcherThread and ParseOutputFormat:
git grep -F db.max.outlinks.per.page
However, it's not to limit the length of single outlink in characters
but the number of outlinks followed (added to CrawlDb).
There was NUTCH-1106 to add a property to limit the outlink length.
> Another problem is that they have fetch_time well into the future,
> I guess because retry_interval is applied.
Correct. Fetch time is
- time when to fetch next for a CrawlDatum in the CrawlDb
- time when fetch has happened for those in segments crawl_fetch folder
On 03/09/2018 11:04 PM,
> What is the best way to handle this, in general? I am thinking of specifying
> http.redirect.max=1
(rather than the default 0) in nutch-site.xml because I want it to fetch these
pages right away,
rather than waiting until the next cycle.
Of course, you can do this. But keep in mind: if both,
101 - 200 of 655 matches
Mail list logo