Hi Sheham,
the nutch-site.xml configures
mapreduce.task.timeout
1800
1.8 seconds (1800 milliseconds) is very short. The default is 600 seconds or 10
minutes, see [1]. Since Nutch needs to finish fetching before the task timeout
applies, threads fetching not quickly enough and
https://github.com/sebastian-nagel/nutch-test-single-node-cluster/
One note about the CHANGES.md: it's now a mixture of HTML and plain text.
It does not use the potential of markdown, e.g. sections / headlines for
the releases to make the change log navigable via a table of contents.
The embedded
Hi Tim,
>> I'm using the okhttp protocol, because I don't think the http protocol
>> stores truncation information.
protocol-http could mark truncations as well, however. Please, also open an
issue for this and other protocol plugins.
>> Should I open a ticket to have ParseSegment also
Hi Michael,
> I wonder if there is not already a build-in option to exclude HTML
> elements (like a div with a given id or class or other elements like header).
No, there isn't one so far.
> I know https://issues.apache.org/jira/browse/NUTCH-585
> I also do not understand why this little
Hi,
yes, this is possible by pointing the environment variable
NUTCH_LOG_DIR to a different folder.
The default is: $NUTCH_HOME/logs/
See also the script bin/nutch which is called by bin/crawl:
https://github.com/apache/nutch/blob/master/src/bin/nutch#L30
(it's also possible to change the log
, 2023 at 10:36 AM Sebastian Nagel
wrote:
Hi Steve,
>
file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
what does the file contain? An .eml file (following RFC822)?
Would it be possible to share this file or at least a chunk large
e
Hi Steve,
>
file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
what does the file contain? An .eml file (following RFC822)?
Would it be possible to share this file or at least a chunk large
enough to reproduce the issue?
The error
Dear all,
It is my pleasure to announce that Tim Allison has joined us
as a committer and member of the Nutch PMC.
You may already know Tim as a maintainer of and contributor to
Apache Tika. So, it was great to see contributions to the
Nutch source code from an experienced developer who is also
Hi Eric,
unfortunately, on Windows you also need to download and install winutils.exe and
hadoop.dll,
see
https://github.com/cdarlint/winutils and
https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io
The installation of
Hi Kamil,
> I was wondering if this script is advisable to use?
I haven't tried the script itself but some of the underlying commands
- mergedb, etc.
> merge command ($nutch_dir/nutch merge $index_dir $new_indexes)
Of course, some of the commands are obsolete. Long time ago, Nutch
used Lucene
Hi,
please send a mail to
user-unsubscr...@nutch.apache.org
See
https://nutch.apache.org/community/mailing-lists/
Thanks!
Best,
Sebastian
On 1/25/23 14:53, Steven Zhu wrote:
Please unsubscribe me from the users list.
Steven
On Tue, Jan 24, 2023 at 10:27 PM Ankit gupta
wrote:
owse/NUTCH-2974
Just in case you want to try it.
~Sebastian
On 11/21/22 10:36, Sebastian Nagel wrote:
Hi Kamil,
thanks for trying and finding a solution! I've open a JIRA issue to track the
problem: https://issues.apache.org/jira/browse/NUTCH-2974
Thanks!
Sebastian
On 11/19/22 18:37, Kam
Hadoop cluster. All commands are the same than in fully
distributed mode.
If it helps, I prepared some setup scripts to run Nutch in pseudo-distributed
mode:
https://github.com/sebastian-nagel/nutch-test-single-node-cluster
Best,
Sebastian
On 1/15/23 04:26, Mike wrote:
I will now try to confi
Hi Mike,
> It can be tedious to set up for the first time, and there are many components.
In case you prefer Linux packages, I can recommend Apache Bigtop, see
https://bigtop.apache.org/
and for the list of package repositories
https://downloads.apache.org/bigtop/stable/repos/
~Sebastian
Hi Paul,
> the indexer was writing the
> documents info in the file (nutch.csv) twice,
Yes, I see. And now I know what I've overseen:
.../bin/nutch index -Dmapreduce.job.reduces=2
You need to run the CSV indexer with only a single reducer.
In order to do so, please pass the option
Hi Paul,
as far I can see the indexer is run only once and now indexes 26 documents:
org.apache.nutch.indexer.IndexingJob 2022-11-22 06:32:57,164 INFO
o.a.n.i.IndexingJob [main] Indexer: 26 indexed (add/update)
The logs also indicate that both segments are indexed at once:
Hi Kamil,
thanks for trying and finding a solution! I've open a JIRA issue to track the
problem: https://issues.apache.org/jira/browse/NUTCH-2974
Thanks!
Sebastian
On 11/19/22 18:37, Kamil Mroczek wrote:
I've been able to work around this issue by adding "pattern" to touch tag
on line 101
Hi everybody,
because of a growing number of spam account creation public sign-ups to the
Apache JIRA have been disabled.
In order to allow users to report bugs, we have two options:
1 either users let us know about the issue on the mailing list and one of the
Nutch PMC creates a user account
Hi Paul,
yes, the CSV indexer removes the CSV output before it starts a new one.
The problem here is that the indexer is run twice in a loop.
Possible work-arounds - assumed you're using the script bin/crawl:
1 after each indexing command in the loop, move the CSV output so that
it gets not
Hi Mike, hi Markus,
there's also
https://issues.apache.org/jira/browse/NUTCH-1806
which would make it much easier to keep up-to-date with the public suffix list.
Resp., because crawler-commons loads the public suffix list
(for historic reasons named "effective_tld_names.dat") from the class
The Apache Nutch team is pleased to announce the release of
Apache Nutch v1.19.
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™ data structures.
Source and binary distributions are available for download from the
Hi Folks,
thanks to everyone who was able to review the release candidate!
72 hours have definitely passed, please see below for vote results.
[4] +1 Release this package as Apache Nutch 1.19
Markus Jelsma *
BlackIce *
Jorge Betancourt *
Sebastian Nagel *
[0] -1 Do not release
nks
> Mike
>
> Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel
> :
>
>> Hi Mike,
>>
>> the Nutch/Solr schema.xml will be updated with the release of 1.19
>> (expected
>> soon, a vote about RC#1 is ongoing):
>> [NUTCH-2955] - replace
cache.
>
> Since Ralf can compile it without problems, it seems to be an issue on my
> machine only. So Nutch seems fine, therefore +1.
>
> Regards,
> Markus
>
> [1]
> https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/
>
>
> Op zo 2
Hi Mike,
the Nutch/Solr schema.xml will be updated with the release of 1.19 (expected
soon, a vote about RC#1 is ongoing):
[NUTCH-2955] - replace deprecated/removed field type solr.LatLonType
[NUTCH-2957] - add fall-back field definitions for unknown index fields
[NUTCH-2956] - typos in field
ss]
>>>>
>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>>> explanation.
>>>> SLF4J: Actual binding is of type
>>>> [org.apache.logging.slf4j.Log4jLoggerFactory]
>>>>
>>>> And t
g4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
>
> I am worried about the indexer-elastic plugin, maybe others have that
> problem too? Otherwise everything seems fine.
>
> Markus
>
> Op ma 22 aug. 2022 om 17:3
://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
ote:
> Sounds good!
>
> I see we're still at Tika 2.3.0, i'll submit a patch to upgrade to the
> current 2.4.1.
>
> Thanks!
> Markus
>
> Op di 9 aug. 2022 om 09:11 schreef Sebastian Nagel :
>
>> Hi all,
>>
>> more than 60 issues are done for Nutch 1
Hi all,
more than 60 issues are done for Nutch 1.19
https://issues.apache.org/jira/projects/NUTCH/versions/12349580
including
- important dependency upgrades
- Hadoop 3.3.3
- Any23 2.7
- Tika 2.3.0
- plugin-specific URL stream handlers (NUTCH-2429)
- migration
- from Java/JDK 8
Fyi, the issue is tracked on
https://issues.apache.org/jira/browse/NUTCH-2955
~Sebastian
On 7/14/22 12:54, Sebastian Nagel wrote:
> Hi Mike,
>
> if you do not use the plugin index-geoip, you could simply delete the line
>
> subFieldSuffix="_coordinate&
Hi Rastko,
the description isn't really correct now as NUTCH_HOME is supposed to point to
the runtime
- if the binary package is used: this is the base folder of the package,
eg. apache-nutch-1.18/
- if Nutch is built from the source, you usually point NUTCH_HOME to
runtime/local/ - the
Hi Bob,
could you share which instructions and when the error happens - during import,
project build, running/debugging?
The usual way is
1. to write the Eclipse project configuration, run
ant eclipse
2. import the written project configuration into Eclipse
Building or running/debugging
Hi Mike,
if you do not use the plugin index-geoip, you could simply delete the line
Otherwise, after the deprecation and the removal of the LatLonType class [1],
it should be:
But I haven't verified whether indexing with index-geoip enabled and the
retrieval works.
In any case,
Hi Michael,
Nutch (1.18, and trunk/master) should work together with more recent Hadoop
versions.
At Common Crawl we use a modified Nutch version based on the recent trunk
running on Hadoop 3.2.2 (soon 3.2.3) and Java 11, even on a mixed Hadoop cluster
with x64 and arm64 AWS EC2 instances.
But
Hi Michael,
the only differences in the protocol-httpclient plugin between Nutch 1.11 and
1.13 are
- NUTCH-2280 [1] which allows to configure the cookie policy
- NUTCH-2355 [2] which allows to set an explicit cookie for a request URL
Could this be related?
Are there any useful hints what could
t; processing.
>
> Kind regards,
> Roseline
>
>
>
>
>
> -Original Message-
> From: Sebastian Nagel
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
>
> Hi Roseline,
>
>> the
r
> than it will be truncated; otherwise, no truncation at all. Do not
> confuse this setting with the file.content.limit setting.
>
>
>
> db.ignore.external.links.mode
> byHost
>
>
> db.injector.overwrite
> true
>
>
> http.timeout
Hi Ayhan,
you mean?
https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt
Sebastian
On 12/13/21 20:59, Ayhan Koyun wrote:
> Hi,
>
> as I wrote before, it seems that I am not the only one who can not crawl all
> the seed.txt url's. I
Hi Roseline,
> 5,36405,0,http://www.notco.com
What is the status for https://notco.com/which is the final redirect
target?
Is the target page indexed?
~Sebastian
Antai
> Research Fellow
> Hunter Centre for Entrepreneurship
> Strathclyde Business School
> University of Strathclyde, Glasgow, UK
>
>
> The University of Strathclyde is a charitable body, registered in Scotland,
> number SC015263.
>
>
> -Original Message-
Hi Shi Wei,
fyi: a fix for NUTCH-2903 is ready
https://github.com/apache/nutch/pull/703
Sebastian
On 11/16/21 13:54, Sebastian Nagel wrote:
> Hi Shi Wei,
>
> looks like you're the first trying to connect to ES from Nutch over
> HTTPS. HTTP is used as default scheme and the
The issue is now tracked in
https://issues.apache.org/jira/browse/NUTCH-2907
On 10/28/21 15:31, Sebastian Nagel wrote:
> Hi Shi Wei,
>
> sorry, but it looks like the Selenium protocol plugin has never been
> used with a proxy over https. There are two points which need (at a
&g
gt; following in the log4j.properties but it doesn't help.
>
> log4j.logger.org.apache.nutch.indexwriter.elastic.ElasticIndexWriter=WARN,cmdstdout
> log4j.logger.org.apache.nutch.indexwriter.elastic.ElasticUtils=WARN,cmdstdout
>
>
> Best Regards,
> Shi Wei
>
> On 202
Hi Shi Wei,
looks like you're the first trying to connect to ES from Nutch over
HTTPS. HTTP is used as default scheme and there is no way to configure
the Elasticsearch index writer to use HTTPS.
Please open a Jira issue. It's a trivial fix.
For a quick fix: in the Nutch source package (or
Hi Max,
fyi, the Jira issue is created:
https://issues.apache.org/jira/browse/NUTCH-2902
(to make sure that this is not forgotten)
Thanks,
Sebastian
On 10/11/21 18:11, Sebastian Nagel wrote:
> Hi Max,
>
>> I was able to fix this by switching from JexlExpression to JexlScrip
Hi Shi Wei,
there is a way, although definitely not the recommended one.
Sorry, and it took me a little bit to proof it.
Do you know about external XML entities or XXE attacks?
1. On top of the index-writers.xml you add an entity declaration:
]>
2. it's used later in the index writer spec:
Hi Shi Wei,
sorry, but it looks like the Selenium protocol plugin has never been
used with a proxy over https. There are two points which need (at a
first glance) a rework:
1. the protocol tries to establish a TLS/SSL connection to the proxy if
the URL to be crawled is a https:// URL. There
Authentication Scheme
>
> Your sincerely,
> Shi Wei
>
> -Original Message-
> From: Sebastian Nagel
> Sent: Monday, 25 October, 2021 5:31 PM
> To: user@nutch.apache.org
> Subject: Re: Encrypt or Mask the password
>
> Hi Shi Wei,
>
> for t
Hi Shi Wei,
for the nutch-site.xml it's possible to use Java properties and/or
environment variables,
see section "Variable expansion" in
https://hadoop.apache.org/docs/r3.3.1/api/org/apache/hadoop/conf/Configuration.html
In case you're asking about index-writers.xml - variable expansion
Hi Shi Wei,
could you also share the index writer configuration (conf/index-writers.xml)?
The default is unauthenticated access to Solr, see the snippet below.
The file httpclient-auth.xml is not relevant for the Solr indexer, it's
used if a crawled web site requires authentication in order to
Hi Max,
> I was able to fix this by switching from JexlExpression to JexlScript. I
> have a small patch that I'm happy to contribute!
Yes, that would be great! Please open also a Jira issue so that the
problem shows up in the Changelog.
Thanks!
Best,
Sebastian
On 10/11/21 6:34 AM, Max
Hi Markus,
the okhttp protocol plugin should work out-of-the-box
and we use it in production (currently on Hadoop 3.2.2)
I remember that I had once an issue with the Hadoop library
having okhttp as a dependency which then caused a conflict.
It was solved by adding an exclusion rule to the
Hi Clark,
thanks for summarizing this discussion and sharing the final configuration!
Good to know that it's possible to run Nutch on Hadoop using S3A without
using HDFS (no namenode/datanodes running).
Best,
Sebastian
> The local file system? Or hdfs:// or even s3:// resp. s3a://?
Also important: the value of "mapreduce.job.dir" - it's usually
on hdfs:// and I'm not sure whether the plugin loader is able to
read from other filesystems. At least, I haven't tried.
On 6/15/21 10:53 AM, Sebastia
Hi Clark,
sorry, I should read your mail until the end - you mentioned that
you downgraded Nutch to run with JDK 8.
Could you share to which filesystem does NUTCH_HOME point?
The local file system? Or hdfs:// or even s3:// resp. s3a://?
Best,
Sebastian
On 6/15/21 10:24 AM, Clark Benham
Hi Clark,
the class URLNormalizer is not in a plugin - it's part of Nutch core and defines the interface for URL normalizer plugins. Looks like
there's something wrong fundamentally, not only with the plugins.
> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
Are you aware that the
Hi Gorkem,
I haven't verified it by trying - but it may be that given your configuration
the Solr instance isn't reachable via
http://localhost:8983/solr/nutch
Inside the Docker network, host names are the same as container names, that is
http://solr:8983/solr/nutch
might work. Cf. the
Hi Lewis, hi Markus,
> snappy compression, which is a massive improvement for large data shuffling
jobs
Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for
longer. We use it for a 25-billion CrawlDB: it's almost as fast (both
compression
and decompression) as
Thanks! Interesting that the dublexweb bot ignores the wildcard user agent
rules by default.
On 6/3/21 11:44 PM, lewis john mcgibbney wrote:
Some interesting content for a short read :)
-data-europe/docker-hadoop
[2]
https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11
ile.
Although looking at it now it's clear.
This makes it easier for me to access the html content within my plugin,
thanks again
On Fri, May 28, 2021 at 8:36 PM Sebastian Nagel
wrote:
Hi Kieran,
see the command-line options
-addBinaryContent
index raw/binary con
Hi Kieran,
see the command-line options
-addBinaryContent
index raw/binary content in field `binaryContent`
-base64
use Base64 encoding for binary content
of the Nutch index job [1]. Note that the content maybe indeed
binary, eg. for PDF documents but also
Hi Prateek,
alternatively, you could modify the URLPartitioner [1], so that during the
"generate" step
the URLs of a specific host or domain are distributed over more partitions. One
partition
is the fetch list of one fetcher map task. At Common Crawl we partition by
domain and made
the
yfro.com/>
2021-05-05 17:35:30,786 INFO [main] org.apache.nutch.fetcher.FetcherThread:
FetcherThread 50 has no more work available/
I am not sure what I am missing.
Regards
Prateek
On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel mailto:wastl.na...@googlemail.com>> wrote:
Hi Prateek,
could you share
Hi Lewis,
> 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet
format?
Yes, but not directly - it's a multi-step process. The outcome:
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
This Parquet index is optimized by sorting the
Hi Andrew,
> if this flag is used *--sitemaps-from-hostdb always*
Do the crawled hosts announce the sitemap in their robots.txt?
If not does the sitemap URLs follow the pattern
http://example.com/sitemap.xml ?
See https://cwiki.apache.org/confluence/display/NUTCH/SitemapFeature
If this is
Hi Prateek,
are there any URL filters which filter away image links?
You can verify this using the URL filter checker:
echo "https://example.com/image.jpg; \
| bin/nutch filterchecker -stdin
The default rules in conf/regex-urlfilter.txt exclude common
image suffixes. Note that there can
Hi,
no, NUTCH-2353 is still open, see
https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2353
The implementation caused a regression, so it was reverted.
Best,
Sebastian
On 12/6/20 7:03 AM, Von Kursor wrote:
> Hello
>
> Has this API enhancement been implemented under 1.17 ?
>
> I
Hi,
> Nutch 2.4 with selenium
Nutch 2.4 does not include any plugin to use Selenium. In addition, 2.4 is for
now the last release on the 2.x branch which is not
maintained anymore. You should use 1.x (1.17 is the
most recent release.
> standalone nutch crawling with selenium.
For 1.x there's
Hi,
this question is better asked on the Solr user mailing list
as Nutch people are not necessarily familiar with Solr on a deep level.
Please also share more details - which JavaScript client, the error message,
the log messages of the Solr server at this time. This helps to trace the
error
from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
> From: Sebastian Nagel<mailto:wastl.na...@googlemail.com.INVALID>
> Sent: Tuesday, August 11, 2020 4:56 PM
> To: user@nutch.apache.org<mailto:user@nutch.apache.org>
> Subject: Re: Regarding N
Hi Joe,
> I eliminated it when I updated the index-writers.xml for the solr_indexer_1
> to use only a single URL.
Thanks for the hint. I'm able to reproduce the error by adding an overlong URL
to
Could you open an issue to fix this on
https://issues.apache.org/jira/projects/NUTCH ?
Hi,
Nutch does not include a search component anymore. These steps are obsolete.
All you need is to setup your Hadoop cluster, then run
$NUTCH_HOME/runtime/deploy/bin/nutch ...
(instead of .../runtime/local/bin/nutch ...)
Alternatively, you could launch a Nutch tool, eg. Injector
the
Dear all,
it is my pleasure to announce that Shashanka Balakuntala Srinivasa has joined us
as a committer and member of the Nutch PMC. Shashanka Balakuntala has worked
recently
on a longer list of Nutch issues and improvements.
Thanks, Shashanka Balakuntala, and congratulations on your new role
o conversion to fetch Job directly
> so see if there are some improvements.
>
> I have also concluded this discussion here -
> https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/.
> So if you want to add something here, please feel free to do so.
>
> Regard
Fetcher will be directly creating the final
>> avro format that I need. So the only question remains is that if I do
>> fetcher.parse=true, can I get rid of parse Job as a separate step
>> completely.
>>
>> Regards
>> Prateek
>>
>> On Tue, Jul 21, 2020
avro conversion step, we just convert data into avro schema
> and dump to HDFS. Do you think we still need reducers in the fetch phase?
> FYI- I tried running with 0 reducers and don't see any impact as
> such.
>
> Appreciate your help.
>
> Regards
> Prateek
>
> On Tu
Hi Prateek,
you're right there is no specific reducer used but without a reduce step
the segment data isn't (re)partitioned and the data isn't sorted.
This was a strong requirement once Nutch was a complete search engine
and the "content" subdir of a segment was used as page cache.
Getting the
The Apache Nutch team is pleased to announce the release of
Apache Nutch v1.17.
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™ data structures.
Source and binary distributions are available for download from the
Hi Folks,
thanks to everyone who was able to review the release candidate!
72 hours have passed, please see below for vote results.
[4] +1 Release this package as Apache Nutch 1.17
Markus Jelsma *
Furkan Kamaci *
Shashanka Balakuntala Srinivasa
Sebastian Nagel *
[0] -1 Do
Hi Craig,
in case, you're building Nutch from the git repo or from the source package
the easiest way is to put the file NewCustomHandler.java into
src/plugin/protocol-interactiveselenium/src/java/.../handlers/
and run
ant runtime
to compile and package Nutch including package your custom
Hi Folks,
A first candidate for the Nutch 1.17 release is available at:
https://dist.apache.org/repos/dist/dev/nutch/1.17/
The release candidate is a zip and tar.gz archive of the binary and sources in:
https://github.com/apache/nutch/tree/release-1.17
In addition, a staged maven
Hi,
the list of open issues for 1.17 became short, and I will move some of the
remaining
issues to 1.18 to get the way free and prepare the first release candidate in
the
next two days.
If there are urgent fixes (including a PR / patch). Let me know!
Thanks,
Sebastian
Hi Jim,
Nutch 1.17 should land soon but there are a couple of issue to be fixed before
the release.
Best,
Sebastian
On 6/8/20 12:11 AM, Lewis John McGibbney wrote:
> Hi Jim,
> Response below
>
> On 2020/06/06 14:23:24, Jim Anderson wrote:
>>
>> I cannot find a download for Nutch 1.17. Is
t;
>>
>> user Digest 23 Apr 2020 06:27:46 - Issue 3055
>>
>> Topics (messages 34517 through 34517)
>>
>> [DISCUSS] Release 1.17 ?
>> 34517 by: Sebastian Nagel
>>
>> Administrivia:
>>
>> --
Hi all,
30 issues are done now
https://issues.apache.org/jira/browse/NUTCH/fixforversion/12346090
including a number of important dependency upgrades:
- Hadoop 3.1 (NUTCH-2777)
- Elasticsearch 7.3.0 REST client (NUTCH-2739)
Thanks to Shashanka Balakuntala Srinivasa for both!
Dependency
Hi Robert,
404s are recorded in the CrawlDb after the tool "updatedb" is called.
Could you share the commands you're running? Please also have a look into the
log files (esp. the
hadoop.log) - all fetches are logged and
also whether fetches have failed. If you cannot find a log message
for the
promising. Hope you enjoy the holiday!
>
> Joe
>
> -Original Message-----
> From: Sebastian Nagel
> Sent: Thursday, January 2, 2020 7:42 AM
> To: user@nutch.apache.org
> Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15
>
> Hi Joseph,
>
Hi Joseph,
this could be related to
https://issues.apache.org/jira/browse/NUTCH-2525
caused by not-all-lowercase meta keys.
I'm happy to check whether the attached patch fixes your problem
when I'm back from holidays in a few days.
Best,
Sebastian
On 12/31/19 5:43 PM, Gilvary, Joseph wrote:
Hi,
the test compares names of the "host" and the registered domain:
doc.getFieldValue('host')=='urgenthomework.com'
The host name is "www.urgenthomework.com". You can test it via:
$> bin/nutch indexchecker https://www.urgenthomework.com/
fetching: https://www.urgenthomework.com/
...
robots.txt whitelist not configured.
> Fetch failed with protocol status: gone(11), lastModified=0:
> https://www.avalonpontoons.com/
>
>
> On Tue, Dec 17, 2019 at 11:53 AM Sebastian Nagel
> wrote:
>
>> Hi Bob,
>>
>> the relevant Javadoc comment stands before the decla
Hi Bob,
the relevant Javadoc comment stands before the declaration of a variable (here
a constant):
/** Resource is gone. */
public static final int GONE = 11;
More detailed, GONE results from one of the following HTTP status codes:
400 Bad request
401 Unauthorized
410 Gone (*forever*
Hi Makkara,
> but I believe that this is the fault of the reducer
> Map input records=22048
> Map output records=4
The items are skipped in the mapper.
> Is this a known problem of Nutch 2.4, or have I just misconfigured
> something?
Could be the configuration
Hi Anton,
after a short look into MetadataIndexer:
- it does not request any fields from the webpage,
see getFields() method
- this is a bug (but already was in 2.3.1)
- it could be worked around by activating another
plugin which requests the METADATA field/column,
eg.
Hi Sachin,
> What I have observed is that it usually fetches, parses and indexes
> 1800 web pages.
This means 10 pages per minute.
How are the 1800 pages distributed over hosts?
The default delay between successive fetches to the same host is
5 seconds. If all pages belong to the same host,
do we have to call the updatedb command on the merged segment to update
the crawldb so that it has all the information for next cycle.
Thanks
Sachin
On Tue, Oct 22, 2019 at 1:32 PM Sebastian Nagel
wrote:
Hi Sachin,
> I want to know once a new segment is generated is there any use of
> p
Hi Sachin,
> I want to know once a new segment is generated is there any use of
> previous segments and can they be deleted?
As soon as a segment is indexed and the CrawlDb is updated from this
segment, you may delete it. But keeping older segments allows
- reindexing in case something went
Hi Markus,
any updates on this? Just to make sure the issue gets resolved.
Thanks,
Sebastian
On 14.10.19 17:08, Markus Jelsma wrote:
Hello,
We're upgrading our stuff to 1.16 and got a peculiar problem when we started
indexing:
2019-10-14 13:50:30,586 WARN [main]
Hi Dave,
> the crawl script without the -i parameter, does that mean the crawl will
> run and complete without updating SOLR?
Yes.
> Then I'll use solrindex to push the crawled content into
> SOLR later, when I'm ready.
Better call "index", the command "solrindex" is deprecated,
in fact, it
1 - 100 of 655 matches
Mail list logo