Re: Help posting question

2024-04-25 Thread Sebastian Nagel
Hi Sheham, the nutch-site.xml configures mapreduce.task.timeout 1800 1.8 seconds (1800 milliseconds) is very short. The default is 600 seconds or 10 minutes, see [1]. Since Nutch needs to finish fetching before the task timeout applies, threads fetching not quickly enough and

Re: [VOTE] Apache Nutch 1.20 Release

2024-04-11 Thread Sebastian Nagel
https://github.com/sebastian-nagel/nutch-test-single-node-cluster/ One note about the CHANGES.md: it's now a mixture of HTML and plain text. It does not use the potential of markdown, e.g. sections / headlines for the releases to make the change log navigable via a table of contents. The embedded

Re: truncation, parsing and indexing?

2023-10-23 Thread Sebastian Nagel
Hi Tim, >> I'm using the okhttp protocol, because I don't think the http protocol >> stores truncation information. protocol-http could mark truncations as well, however. Please, also open an issue for this and other protocol plugins. >> Should I open a ticket to have ParseSegment also

Re: Exclude HTML elements from Crawl

2023-09-23 Thread Sebastian Nagel
Hi Michael, > I wonder if there is not already a build-in option to exclude HTML > elements (like a div with a given id or class or other elements like header). No, there isn't one so far. > I know https://issues.apache.org/jira/browse/NUTCH-585 > I also do not understand why this little

Re: Change log file directory

2023-08-07 Thread Sebastian Nagel
Hi, yes, this is possible by pointing the environment variable NUTCH_LOG_DIR to a different folder. The default is: $NUTCH_HOME/logs/ See also the script bin/nutch which is called by bin/crawl: https://github.com/apache/nutch/blob/master/src/bin/nutch#L30 (it's also possible to change the log

Re: Maximum header limit (1000) exceeded

2023-07-26 Thread Sebastian Nagel
, 2023 at 10:36 AM Sebastian Nagel wrote: Hi Steve, > file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67 what does the file contain? An .eml file (following RFC822)? Would it be possible to share this file or at least a chunk large e

Re: Maximum header limit (1000) exceeded

2023-07-26 Thread Sebastian Nagel
Hi Steve, > file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67 what does the file contain? An .eml file (following RFC822)? Would it be possible to share this file or at least a chunk large enough to reproduce the issue? The error

[ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Sebastian Nagel
Dear all, It is my pleasure to announce that Tim Allison has joined us as a committer and member of the Nutch PMC. You may already know Tim as a maintainer of and contributor to Apache Tika. So, it was great to see contributions to the Nutch source code from an experienced developer who is also

Re: Nutch 1.19 Getting Error: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'

2023-05-15 Thread Sebastian Nagel
Hi Eric, unfortunately, on Windows you also need to download and install winutils.exe and hadoop.dll, see https://github.com/cdarlint/winutils and https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io The installation of

Re: Merging CrawlDBs

2023-02-02 Thread Sebastian Nagel
Hi Kamil, > I was wondering if this script is advisable to use? I haven't tried the script itself but some of the underlying commands - mergedb, etc. > merge command ($nutch_dir/nutch merge $index_dir $new_indexes) Of course, some of the commands are obsolete. Long time ago, Nutch used Lucene

Re: Unsubscribe from Users list

2023-01-25 Thread Sebastian Nagel
Hi, please send a mail to user-unsubscr...@nutch.apache.org See https://nutch.apache.org/community/mailing-lists/ Thanks! Best, Sebastian On 1/25/23 14:53, Steven Zhu wrote: Please unsubscribe me from the users list. Steven On Tue, Jan 24, 2023 at 10:27 PM Ankit gupta wrote:

Re: "Unparseable date" build issue with ANT on AWS EMR

2023-01-17 Thread Sebastian Nagel
owse/NUTCH-2974 Just in case you want to try it. ~Sebastian On 11/21/22 10:36, Sebastian Nagel wrote: Hi Kamil, thanks for trying and finding a solution! I've open a JIRA issue to track the problem: https://issues.apache.org/jira/browse/NUTCH-2974 Thanks! Sebastian On 11/19/22 18:37, Kam

Re: Configuration Nutch in cluster mode

2023-01-17 Thread Sebastian Nagel
Hadoop cluster. All commands are the same than in fully distributed mode. If it helps, I prepared some setup scripts to run Nutch in pseudo-distributed mode: https://github.com/sebastian-nagel/nutch-test-single-node-cluster Best, Sebastian On 1/15/23 04:26, Mike wrote: I will now try to confi

Re: Nutch/Hadoop Cluster

2023-01-17 Thread Sebastian Nagel
Hi Mike, > It can be tedious to set up for the first time, and there are many components. In case you prefer Linux packages, I can recommend Apache Bigtop, see https://bigtop.apache.org/ and for the list of package repositories https://downloads.apache.org/bigtop/stable/repos/ ~Sebastian

Re: CSV indexer file data overwriting

2022-11-24 Thread Sebastian Nagel
Hi Paul, > the indexer was writing the > documents info in the file (nutch.csv) twice, Yes, I see. And now I know what I've overseen: .../bin/nutch index -Dmapreduce.job.reduces=2 You need to run the CSV indexer with only a single reducer. In order to do so, please pass the option

Re: CSV indexer file data overwriting

2022-11-23 Thread Sebastian Nagel
Hi Paul, as far I can see the indexer is run only once and now indexes 26 documents: org.apache.nutch.indexer.IndexingJob 2022-11-22 06:32:57,164 INFO o.a.n.i.IndexingJob [main] Indexer: 26 indexed (add/update) The logs also indicate that both segments are indexed at once:

Re: "Unparseable date" build issue with ANT on AWS EMR

2022-11-21 Thread Sebastian Nagel
Hi Kamil, thanks for trying and finding a solution! I've open a JIRA issue to track the problem: https://issues.apache.org/jira/browse/NUTCH-2974 Thanks! Sebastian On 11/19/22 18:37, Kamil Mroczek wrote: I've been able to work around this issue by adding "pattern" to touch tag on line 101

[DISCUSS] Bug reporting - enabling Github issues?

2022-11-21 Thread Sebastian Nagel
Hi everybody, because of a growing number of spam account creation public sign-ups to the Apache JIRA have been disabled. In order to allow users to report bugs, we have two options: 1 either users let us know about the issue on the mailing list and one of the Nutch PMC creates a user account

Re: CSV indexer file data overwriting

2022-11-21 Thread Sebastian Nagel
Hi Paul, yes, the CSV indexer removes the CSV output before it starts a new one. The problem here is that the indexer is run twice in a loop. Possible work-arounds - assumed you're using the script bin/crawl: 1 after each indexing command in the loop, move the CSV output so that it gets not

Re: Incomplete TLD List

2022-11-08 Thread Sebastian Nagel
Hi Mike, hi Markus, there's also https://issues.apache.org/jira/browse/NUTCH-1806 which would make it much easier to keep up-to-date with the public suffix list. Resp., because crawler-commons loads the public suffix list (for historic reasons named "effective_tld_names.dat") from the class

[ANNOUNCE] Apache Nutch 1.19 Release

2022-09-08 Thread Sebastian Nagel
The Apache Nutch team is pleased to announce the release of Apache Nutch v1.19. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures. Source and binary distributions are available for download from the

[RESULT] was [VOTE] Release Apache Nutch 1.19 RC#1

2022-09-06 Thread Sebastian Nagel
Hi Folks, thanks to everyone who was able to review the release candidate! 72 hours have definitely passed, please see below for vote results. [4] +1 Release this package as Apache Nutch 1.19 Markus Jelsma * BlackIce * Jorge Betancourt * Sebastian Nagel * [0] -1 Do not release

Re: Nutch 1.19 schema.xml

2022-09-04 Thread Sebastian Nagel
nks > Mike > > Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel > : > >> Hi Mike, >> >> the Nutch/Solr schema.xml will be updated with the release of 1.19 >> (expected >> soon, a vote about RC#1 is ongoing): >> [NUTCH-2955] - replace

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-09-02 Thread Sebastian Nagel
cache. > > Since Ralf can compile it without problems, it seems to be an issue on my > machine only. So Nutch seems fine, therefore +1. > > Regards, > Markus > > [1] > https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/ > > > Op zo 2

Re: Nutch 1.19 schema.xml

2022-09-02 Thread Sebastian Nagel
Hi Mike, the Nutch/Solr schema.xml will be updated with the release of 1.19 (expected soon, a vote about RC#1 is ongoing): [NUTCH-2955] - replace deprecated/removed field type solr.LatLonType [NUTCH-2957] - add fall-back field definitions for unknown index fields [NUTCH-2956] - typos in field

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-28 Thread Sebastian Nagel
ss] >>>> >>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >>>> explanation. >>>> SLF4J: Actual binding is of type >>>> [org.apache.logging.slf4j.Log4jLoggerFactory] >>>> >>>> And t

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-28 Thread Sebastian Nagel
g4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for > more info. > > I am worried about the indexer-elastic plugin, maybe others have that > problem too? Otherwise everything seems fine. > > Markus > > Op ma 22 aug. 2022 om 17:3

[VOTE] Release Apache Nutch 1.19 RC#1

2022-08-22 Thread Sebastian Nagel
://github.com/sebastian-nagel/nutch-test-single-node-cluster/)

Re: [DISCUSS] Release 1.19 ?

2022-08-10 Thread Sebastian Nagel
ote: > Sounds good! > > I see we're still at Tika 2.3.0, i'll submit a patch to upgrade to the > current 2.4.1. > > Thanks! > Markus > > Op di 9 aug. 2022 om 09:11 schreef Sebastian Nagel : > >> Hi all, >> >> more than 60 issues are done for Nutch 1

[DISCUSS] Release 1.19 ?

2022-08-09 Thread Sebastian Nagel
Hi all, more than 60 issues are done for Nutch 1.19 https://issues.apache.org/jira/projects/NUTCH/versions/12349580 including - important dependency upgrades - Hadoop 3.3.3 - Any23 2.7 - Tika 2.3.0 - plugin-specific URL stream handlers (NUTCH-2429) - migration - from Java/JDK 8

Re: Unable to create core Caused by: solr.LatLonType

2022-08-06 Thread Sebastian Nagel
Fyi, the issue is tracked on https://issues.apache.org/jira/browse/NUTCH-2955 ~Sebastian On 7/14/22 12:54, Sebastian Nagel wrote: > Hi Mike, > > if you do not use the plugin index-geoip, you could simply delete the line > > subFieldSuffix="_coordinate&

Re: Question about Nutch plugins

2022-07-24 Thread Sebastian Nagel
Hi Rastko, the description isn't really correct now as NUTCH_HOME is supposed to point to the runtime - if the binary package is used: this is the base folder of the package, eg. apache-nutch-1.18/ - if Nutch is built from the source, you usually point NUTCH_HOME to runtime/local/ - the

Re: Problem with Nutch <-> Eclipse

2022-07-19 Thread Sebastian Nagel
Hi Bob, could you share which instructions and when the error happens - during import, project build, running/debugging? The usual way is 1. to write the Eclipse project configuration, run ant eclipse 2. import the written project configuration into Eclipse Building or running/debugging

Re: Unable to create core Caused by: solr.LatLonType

2022-07-14 Thread Sebastian Nagel
Hi Mike, if you do not use the plugin index-geoip, you could simply delete the line Otherwise, after the deprecation and the removal of the LatLonType class [1], it should be: But I haven't verified whether indexing with index-geoip enabled and the retrieval works. In any case,

Re: Does Nutch work with Hadoop Versions greater than 3.1.3?

2022-06-13 Thread Sebastian Nagel
Hi Michael, Nutch (1.18, and trunk/master) should work together with more recent Hadoop versions. At Common Crawl we use a modified Nutch version based on the recent trunk running on Hadoop 3.2.2 (soon 3.2.3) and Java 11, even on a mixed Hadoop cluster with x64 and arm64 AWS EC2 instances. But

Re: FW: After update from 1.11 to 1.13 form login does not work

2022-05-10 Thread Sebastian Nagel
Hi Michael, the only differences in the protocol-httpclient plugin between Nutch 1.11 and 1.13 are - NUTCH-2280 [1] which allows to configure the cookie policy - NUTCH-2355 [2] which allows to set an explicit cookie for a request URL Could this be related? Are there any useful hints what could

Re: Nutch not crawling all URLs

2022-01-13 Thread Sebastian Nagel
t; processing. > > Kind regards, > Roseline > > > > > > -Original Message- > From: Sebastian Nagel > Sent: 12 January 2022 16:12 > To: user@nutch.apache.org > Subject: Re: Nutch not crawling all URLs > > Hi Roseline, > >> the

Re: Nutch not crawling all URLs

2022-01-12 Thread Sebastian Nagel
r > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > > > > db.ignore.external.links.mode > byHost > > > db.injector.overwrite > true > > > http.timeout

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Ayhan, you mean? https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt Sebastian On 12/13/21 20:59, Ayhan Koyun wrote: > Hi, > > as I wrote before, it seems that I am not the only one who can not crawl all > the seed.txt url's. I

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Roseline, > 5,36405,0,http://www.notco.com What is the status for https://notco.com/which is the final redirect target? Is the target page indexed? ~Sebastian

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Antai > Research Fellow > Hunter Centre for Entrepreneurship > Strathclyde Business School > University of Strathclyde, Glasgow, UK > > > The University of Strathclyde is a charitable body, registered in Scotland, > number SC015263. > > > -Original Message-

Re: Error When Connecting Elasticsearch with HTTPS Connection

2021-11-18 Thread Sebastian Nagel
Hi Shi Wei, fyi: a fix for NUTCH-2903 is ready https://github.com/apache/nutch/pull/703 Sebastian On 11/16/21 13:54, Sebastian Nagel wrote: > Hi Shi Wei, > > looks like you're the first trying to connect to ES from Nutch over > HTTPS. HTTP is used as default scheme and the

Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

2021-11-18 Thread Sebastian Nagel
The issue is now tracked in https://issues.apache.org/jira/browse/NUTCH-2907 On 10/28/21 15:31, Sebastian Nagel wrote: > Hi Shi Wei, > > sorry, but it looks like the Selenium protocol plugin has never been > used with a proxy over https. There are two points which need (at a &g

Re: encrypt password of the index-writer.xml

2021-11-17 Thread Sebastian Nagel
gt; following in the log4j.properties but it doesn't help. >   > log4j.logger.org.apache.nutch.indexwriter.elastic.ElasticIndexWriter=WARN,cmdstdout > log4j.logger.org.apache.nutch.indexwriter.elastic.ElasticUtils=WARN,cmdstdout >   >   > Best Regards, > Shi Wei >   > On 202

Re: Error When Connecting Elasticsearch with HTTPS Connection

2021-11-16 Thread Sebastian Nagel
Hi Shi Wei, looks like you're the first trying to connect to ES from Nutch over HTTPS. HTTP is used as default scheme and there is no way to configure the Elasticsearch index writer to use HTTPS. Please open a Jira issue. It's a trivial fix. For a quick fix: in the Nutch source package (or

Re: JEXL unable to handle "if" statements?

2021-11-11 Thread Sebastian Nagel
Hi Max, fyi, the Jira issue is created: https://issues.apache.org/jira/browse/NUTCH-2902 (to make sure that this is not forgotten) Thanks, Sebastian On 10/11/21 18:11, Sebastian Nagel wrote: > Hi Max, > >> I was able to fix this by switching from JexlExpression to JexlScrip

Re: encrypt password of the index-writer.xml

2021-11-11 Thread Sebastian Nagel
Hi Shi Wei, there is a way, although definitely not the recommended one. Sorry, and it took me a little bit to proof it. Do you know about external XML entities or XXE attacks? 1. On top of the index-writers.xml you add an entity declaration: ]> 2. it's used later in the index writer spec:

Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

2021-10-28 Thread Sebastian Nagel
Hi Shi Wei, sorry, but it looks like the Selenium protocol plugin has never been used with a proxy over https. There are two points which need (at a first glance) a rework: 1. the protocol tries to establish a TLS/SSL connection to the proxy if the URL to be crawled is a https:// URL. There

Re: Encrypt or Mask the password

2021-10-25 Thread Sebastian Nagel
Authentication Scheme > > Your sincerely, > Shi Wei > > -Original Message- > From: Sebastian Nagel > Sent: Monday, 25 October, 2021 5:31 PM > To: user@nutch.apache.org > Subject: Re: Encrypt or Mask the password > > Hi Shi Wei, > > for t

Re: Encrypt or Mask the password

2021-10-25 Thread Sebastian Nagel
Hi Shi Wei, for the nutch-site.xml it's possible to use Java properties and/or environment variables, see section "Variable expansion" in https://hadoop.apache.org/docs/r3.3.1/api/org/apache/hadoop/conf/Configuration.html In case you're asking about index-writers.xml - variable expansion

Re: Cant integrate the kerberos enabled solr cloud with nutch

2021-10-22 Thread Sebastian Nagel
Hi Shi Wei, could you also share the index writer configuration (conf/index-writers.xml)? The default is unauthenticated access to Solr, see the snippet below. The file httpclient-auth.xml is not relevant for the Solr indexer, it's used if a crawled web site requires authentication in order to

Re: JEXL unable to handle "if" statements?

2021-10-11 Thread Sebastian Nagel
Hi Max, > I was able to fix this by switching from JexlExpression to JexlScript. I > have a small patch that I'm happy to contribute! Yes, that would be great! Please open also a Jira issue so that the problem shows up in the Changelog. Thanks! Best, Sebastian On 10/11/21 6:34 AM, Max

Re: OkHttp NoClassDefFoundError: okhttp3/Authenticator

2021-07-24 Thread Sebastian Nagel
Hi Markus, the okhttp protocol plugin should work out-of-the-box and we use it in production (currently on Hadoop 3.2.2) I remember that I had once an issue with the Hadoop library having okhttp as a dependency which then caused a conflict. It was solved by adding an exclusion rule to the

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-15 Thread Sebastian Nagel
Hi Clark, thanks for summarizing this discussion and sharing the final configuration! Good to know that it's possible to run Nutch on Hadoop using S3A without using HDFS (no namenode/datanodes running). Best, Sebastian

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel
> The local file system? Or hdfs:// or even s3:// resp. s3a://? Also important: the value of "mapreduce.job.dir" - it's usually on hdfs:// and I'm not sure whether the plugin loader is able to read from other filesystems. At least, I haven't tried. On 6/15/21 10:53 AM, Sebastia

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel
Hi Clark, sorry, I should read your mail until the end - you mentioned that you downgraded Nutch to run with JDK 8. Could you share to which filesystem does NUTCH_HOME point? The local file system? Or hdfs:// or even s3:// resp. s3a://? Best, Sebastian On 6/15/21 10:24 AM, Clark Benham

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel
Hi Clark, the class URLNormalizer is not in a plugin - it's part of Nutch core and defines the interface for URL normalizer plugins. Looks like there's something wrong fundamentally, not only with the plugins. > I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 Are you aware that the

Re: Apache Nutch help request for a school project :)

2021-06-07 Thread Sebastian Nagel
Hi Gorkem, I haven't verified it by trying - but it may be that given your configuration the Solr instance isn't reachable via http://localhost:8983/solr/nutch Inside the Docker network, host names are the same as container names, that is http://solr:8983/solr/nutch might work. Cf. the

Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-04 Thread Sebastian Nagel
Hi Lewis, hi Markus, > snappy compression, which is a massive improvement for large data shuffling jobs Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for longer. We use it for a 25-billion CrawlDB: it's almost as fast (both compression and decompression) as

Re: DuplexWeb-Google - GoogleBot Crawler For Duplex / Google Assistant

2021-06-04 Thread Sebastian Nagel
Thanks! Interesting that the dublexweb bot ignores the wildcard user agent rules by default. On 6/3/21 11:44 PM, lewis john mcgibbney wrote: Some interesting content for a short read :)

Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-01 Thread Sebastian Nagel
-data-europe/docker-hadoop [2] https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11

Re: Adding html field to NutchDocument

2021-06-01 Thread Sebastian Nagel
ile. Although looking at it now it's clear. This makes it easier for me to access the html content within my plugin, thanks again On Fri, May 28, 2021 at 8:36 PM Sebastian Nagel wrote: Hi Kieran, see the command-line options -addBinaryContent index raw/binary con

Re: Adding html field to NutchDocument

2021-05-28 Thread Sebastian Nagel
Hi Kieran, see the command-line options -addBinaryContent index raw/binary content in field `binaryContent` -base64 use Base64 encoding for binary content of the Nutch index job [1]. Note that the content maybe indeed binary, eg. for PDF documents but also

Re: Crawling same domain URL's

2021-05-11 Thread Sebastian Nagel
Hi Prateek, alternatively, you could modify the URLPartitioner [1], so that during the "generate" step the URLs of a specific host or domain are distributed over more partitions. One partition is the fetch list of one fetcher map task. At Common Crawl we partition by domain and made the

Re: Redirection behavior

2021-05-06 Thread Sebastian Nagel
yfro.com/> 2021-05-05 17:35:30,786 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 has no more work available/ I am not sure what I am missing. Regards Prateek On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel mailto:wastl.na...@googlemail.com>> wrote: Hi Prateek, could you share

Re: Writing Nutch data in Parquet format

2021-05-05 Thread Sebastian Nagel
Hi Lewis, > 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format? Yes, but not directly - it's a multi-step process. The outcome: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/ This Parquet index is optimized by sorting the

Re: googled for ever and still can't figure it out

2021-03-15 Thread Sebastian Nagel
Hi Andrew, > if this flag is used *--sitemaps-from-hostdb always* Do the crawled hosts announce the sitemap in their robots.txt? If not does the sitemap URLs follow the pattern http://example.com/sitemap.xml ? See https://cwiki.apache.org/confluence/display/NUTCH/SitemapFeature If this is

Re: Extract all image and video links from a web page

2021-01-27 Thread Sebastian Nagel
Hi Prateek, are there any URL filters which filter away image links? You can verify this using the URL filter checker: echo "https://example.com/image.jpg; \ | bin/nutch filterchecker -stdin The default rules in conf/regex-urlfilter.txt exclude common image suffixes. Note that there can

Re: NUTCH-2353

2020-12-06 Thread Sebastian Nagel
Hi, no, NUTCH-2353 is still open, see https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2353 The implementation caused a regression, so it was reverted. Best, Sebastian On 12/6/20 7:03 AM, Von Kursor wrote: > Hello > > Has this API enhancement been implemented under 1.17 ? > > I

Re: Nutch 2.4 with selenium

2020-10-10 Thread Sebastian Nagel
Hi, > Nutch 2.4 with selenium Nutch 2.4 does not include any plugin to use Selenium. In addition, 2.4 is for now the last release on the 2.x branch which is not maintained anymore. You should use 1.x (1.17 is the most recent release. > standalone nutch crawling with selenium. For 1.x there's

Re: Unable to get search result using Javascript client..

2020-10-01 Thread Sebastian Nagel
Hi, this question is better asked on the Solr user mailing list as Nutch people are not necessarily familiar with Solr on a deep level. Please also share more details - which JavaScript client, the error message, the log messages of the Solr server at this time. This helps to trace the error

Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode

2020-09-08 Thread Sebastian Nagel
from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 > > From: Sebastian Nagel<mailto:wastl.na...@googlemail.com.INVALID> > Sent: Tuesday, August 11, 2020 4:56 PM > To: user@nutch.apache.org<mailto:user@nutch.apache.org> > Subject: Re: Regarding N

Re: Unable to index on Hadoop 3.2.0 with 1.16

2020-08-12 Thread Sebastian Nagel
Hi Joe, > I eliminated it when I updated the index-writers.xml for the solr_indexer_1 > to use only a single URL. Thanks for the hint. I'm able to reproduce the error by adding an overlong URL to Could you open an issue to fix this on https://issues.apache.org/jira/projects/NUTCH ?

Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode

2020-08-11 Thread Sebastian Nagel
Hi, Nutch does not include a search component anymore. These steps are obsolete. All you need is to setup your Hadoop cluster, then run $NUTCH_HOME/runtime/deploy/bin/nutch ... (instead of .../runtime/local/bin/nutch ...) Alternatively, you could launch a Nutch tool, eg. Injector the

[ANNOUNCE] New Nutch committer and PMC - Shashanka Balakuntala Srinivasa

2020-07-28 Thread Sebastian Nagel
Dear all, it is my pleasure to announce that Shashanka Balakuntala Srinivasa has joined us as a committer and member of the Nutch PMC. Shashanka Balakuntala has worked recently on a longer list of Nutch issues and improvements. Thanks, Shashanka Balakuntala, and congratulations on your new role

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-27 Thread Sebastian Nagel
o conversion to fetch Job directly > so see if there are some improvements. > > I have also concluded this discussion here - > https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/. > So if you want to add something here, please feel free to do so. > > Regard

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-21 Thread Sebastian Nagel
Fetcher will be directly creating the final >> avro format that I need. So the only question remains is that if I do >> fetcher.parse=true, can I get rid of parse Job as a separate step >> completely. >> >> Regards >> Prateek >> >> On Tue, Jul 21, 2020

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-21 Thread Sebastian Nagel
avro conversion step, we just convert data into avro schema > and dump to HDFS. Do you think we still need reducers in the fetch phase? > FYI- I tried running with 0 reducers and don't see any impact as > such. > > Appreciate your help. > > Regards > Prateek > > On Tu

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-21 Thread Sebastian Nagel
Hi Prateek, you're right there is no specific reducer used but without a reduce step the segment data isn't (re)partitioned and the data isn't sorted. This was a strong requirement once Nutch was a complete search engine and the "content" subdir of a segment was used as page cache. Getting the

[ANNOUNCE] Apache Nutch 1.17 Release

2020-07-02 Thread Sebastian Nagel
The Apache Nutch team is pleased to announce the release of Apache Nutch v1.17. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures. Source and binary distributions are available for download from the

[RESULT] was [VOTE] Release Apache Nutch 1.17 RC#1

2020-07-01 Thread Sebastian Nagel
Hi Folks, thanks to everyone who was able to review the release candidate! 72 hours have passed, please see below for vote results. [4] +1 Release this package as Apache Nutch 1.17 Markus Jelsma * Furkan Kamaci * Shashanka Balakuntala Srinivasa Sebastian Nagel * [0] -1 Do

Re: protocol-interactiveselenium Custom Handler

2020-06-25 Thread Sebastian Nagel
Hi Craig, in case, you're building Nutch from the git repo or from the source package the easiest way is to put the file NewCustomHandler.java into src/plugin/protocol-interactiveselenium/src/java/.../handlers/ and run ant runtime to compile and package Nutch including package your custom

[VOTE] Release Apache Nutch 1.17 RC#1

2020-06-18 Thread Sebastian Nagel
Hi Folks, A first candidate for the Nutch 1.17 release is available at: https://dist.apache.org/repos/dist/dev/nutch/1.17/ The release candidate is a zip and tar.gz archive of the binary and sources in: https://github.com/apache/nutch/tree/release-1.17 In addition, a staged maven

Preparing to release 1.17

2020-06-16 Thread Sebastian Nagel
Hi, the list of open issues for 1.17 became short, and I will move some of the remaining issues to 1.18 to get the way free and prepare the first release candidate in the next two days. If there are urgent fixes (including a PR / patch). Let me know! Thanks, Sebastian

Re: Nutch 1.17 download available?

2020-06-08 Thread Sebastian Nagel
Hi Jim, Nutch 1.17 should land soon but there are a couple of issue to be fixed before the release. Best, Sebastian On 6/8/20 12:11 AM, Lewis John McGibbney wrote: > Hi Jim, > Response below > > On 2020/06/06 14:23:24, Jim Anderson wrote: >> >> I cannot find a download for Nutch 1.17. Is

Re: [Non-DoD Source] Re: [DISCUSS] Release 1.17 ? (UNCLASSIFIED)

2020-04-23 Thread Sebastian Nagel
t; >> >> user Digest 23 Apr 2020 06:27:46 - Issue 3055 >> >> Topics (messages 34517 through 34517) >> >> [DISCUSS] Release 1.17 ? >> 34517 by: Sebastian Nagel >> >> Administrivia: >> >> --

[DISCUSS] Release 1.17 ?

2020-04-23 Thread Sebastian Nagel
Hi all, 30 issues are done now https://issues.apache.org/jira/browse/NUTCH/fixforversion/12346090 including a number of important dependency upgrades: - Hadoop 3.1 (NUTCH-2777) - Elasticsearch 7.3.0 REST client (NUTCH-2739) Thanks to Shashanka Balakuntala Srinivasa for both! Dependency

Re: finding broken links with nutch 1.14

2020-03-03 Thread Sebastian Nagel
Hi Robert, 404s are recorded in the CrawlDb after the tool "updatedb" is called. Could you share the commands you're running? Please also have a look into the log files (esp. the hadoop.log) - all fetches are logged and also whether fetches have failed. If you cannot find a log message for the

Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

2020-01-15 Thread Sebastian Nagel
promising. Hope you enjoy the holiday! > > Joe > > -Original Message----- > From: Sebastian Nagel > Sent: Thursday, January 2, 2020 7:42 AM > To: user@nutch.apache.org > Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15 > > Hi Joseph, >

Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

2020-01-02 Thread Sebastian Nagel
Hi Joseph, this could be related to https://issues.apache.org/jira/browse/NUTCH-2525 caused by not-all-lowercase meta keys. I'm happy to check whether the attached patch fixes your problem when I'm back from holidays in a few days. Best, Sebastian On 12/31/19 5:43 PM, Gilvary, Joseph wrote:

Re: Fwd: Crawling 3 websites from one nutch

2019-12-27 Thread Sebastian Nagel
Hi, the test compares names of the "host" and the registered domain: doc.getFieldValue('host')=='urgenthomework.com' The host name is "www.urgenthomework.com". You can test it via: $> bin/nutch indexchecker https://www.urgenthomework.com/ fetching: https://www.urgenthomework.com/ ...

Re: Fetch failed with protocol status: gone(11)

2019-12-17 Thread Sebastian Nagel
robots.txt whitelist not configured. > Fetch failed with protocol status: gone(11), lastModified=0: > https://www.avalonpontoons.com/ > > > On Tue, Dec 17, 2019 at 11:53 AM Sebastian Nagel > wrote: > >> Hi Bob, >> >> the relevant Javadoc comment stands before the decla

Re: Fetch failed with protocol status: gone(11)

2019-12-17 Thread Sebastian Nagel
Hi Bob, the relevant Javadoc comment stands before the declaration of a variable (here a constant): /** Resource is gone. */ public static final int GONE = 11; More detailed, GONE results from one of the following HTTP status codes: 400 Bad request 401 Unauthorized 410 Gone (*forever*

Re: Map reducer filtering too many sites during generation in Nutch 2.4

2019-11-14 Thread Sebastian Nagel
Hi Makkara, > but I believe that this is the fault of the reducer > Map input records=22048 > Map output records=4 The items are skipped in the mapper. > Is this a known problem of Nutch 2.4, or have I just misconfigured > something? Could be the configuration

Re: Metadata not indexed after migrating to Nutch 2.4

2019-11-11 Thread Sebastian Nagel
Hi Anton, after a short look into MetadataIndexer: - it does not request any fields from the webpage, see getFields() method - this is a bug (but already was in 2.3.1) - it could be worked around by activating another plugin which requests the METADATA field/column, eg.

Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Sebastian Nagel
Hi Sachin, > What I have observed is that it usually fetches, parses and indexes > 1800 web pages. This means 10 pages per minute. How are the 1800 pages distributed over hosts? The default delay between successive fetches to the same host is 5 seconds. If all pages belong to the same host,

Re: what happens to older segments

2019-10-22 Thread Sebastian Nagel
do we have to call the updatedb command on the merged segment to update the crawldb so that it has all the information for next cycle. Thanks Sachin On Tue, Oct 22, 2019 at 1:32 PM Sebastian Nagel wrote: Hi Sachin, > I want to know once a new segment is generated is there any use of > p

Re: what happens to older segments

2019-10-22 Thread Sebastian Nagel
Hi Sachin, > I want to know once a new segment is generated is there any use of > previous segments and can they be deleted? As soon as a segment is indexed and the CrawlDb is updated from this segment, you may delete it. But keeping older segments allows - reindexing in case something went

Re: Unable to index on Hadoop 3.2.0 with 1.16

2019-10-22 Thread Sebastian Nagel
Hi Markus, any updates on this? Just to make sure the issue gets resolved. Thanks, Sebastian On 14.10.19 17:08, Markus Jelsma wrote: Hello, We're upgrading our stuff to 1.16 and got a peculiar problem when we started indexing: 2019-10-14 13:50:30,586 WARN [main]

Re: Crawl Command Question

2019-10-19 Thread Sebastian Nagel
Hi Dave, > the crawl script without the -i parameter, does that mean the crawl will > run and complete without updating SOLR? Yes. > Then I'll use solrindex to push the crawled content into > SOLR later, when I'm ready. Better call "index", the command "solrindex" is deprecated, in fact, it

  1   2   3   4   5   6   7   >