Re: Parsed segment has outlinks filtered

2019-10-18 Thread Sebastian Nagel
Hi Sachin, practically every Nutch tool (inject, generate, fetch, parse, update, index) can filter (and normalize) URLs. Because filtering and normalizing is expensive only the steps which add new URLs (inject and parse) do this by default (see bin/crawl). For your use case you might instead

Re: Unable to index on Hadoop 3.2.0 with 1.16

2019-10-14 Thread Sebastian Nagel
Hi Markus, I've tested in pseudo-distributed mode with Hadoop 3.2.1, including indexing into Solr. It worked. Could be a dependency version issue similar to that causing NUTCH-2706. But that's only an assumption. Since the IndexWriters.describe() is for help only, I would just deactivate this

Re: metatags missing with parse-html

2019-10-14 Thread Sebastian Nagel
Hi Dave, could you share an example document? Which Nutch version is used? I tried to reproduce the problem without success using Nutch v1.16: - example document: Test metatags test for metatag extraction - using parse-html (works) > bin/nutch indexchecker -Dmetatags.names='*' \

[ANNOUNCE] Apache Nutch 1.16 Release

2019-10-11 Thread Sebastian Nagel
Hi folks! The Apache Nutch [0] Project Management Committee are pleased to announce the immediate release of Apache Nutch v1.16. We advise all current users and developers to upgrade to this release. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained

[RESULT] was [VOTE] Release Apache Nutch 1.16 RC#1

2019-10-08 Thread Sebastian Nagel
Betancourt Gonzalez * Sebastian Nagel * [0] -1 Do not release this package because ... * Nutch PMC The VOTE passes with 6 binding votes from Nutch PMC members. I'll continue and publish the release packages. Tomorrow, after the packages have been propagated to all mirrors, I'll send

[RESULT] was [VOTE] Release Apache Nutch 2.4 RC#1

2019-10-08 Thread Sebastian Nagel
Hi Folks, thanks to everyone who was able to review the release candidate! 72 hours have passed, please see below for vote results. [4] +1 Release this package as Apache Nutch 2.4 Lewis John McGibbney * Jorge Luis Betancourt Gonzalez * Furkan Kamaci * Sebastian Nagel * [0] -1 Do

Re: [VOTE] Release Apache Nutch 1.16 RC#1

2019-10-04 Thread Sebastian Nagel
Hi Markus, > 2019-10-03 12:48:49,696 INFO crawl.Generator - Generator: number of items > rejected during selection: > 2019-10-03 12:48:49,698 INFO crawl.Generator - Generator: 1 > SCHEDULE_REJECTED see NUTCH-2737 Generator: count and log reason of rejections during selection - useful

[VOTE] Release Apache Nutch 1.16 RC#1

2019-10-02 Thread Sebastian Nagel
Hi Folks, A first candidate for the Nutch 1.16 release is available at: https://dist.apache.org/repos/dist/dev/nutch/1.16/ The release candidate is a zip and tar.gz archive of the binary and sources in: https://github.com/apache/nutch/tree/release-1.16 In addition, a staged maven

Re: [VOTE] Release Apache Nutch 2.4 RC#1

2019-10-01 Thread Sebastian Nagel
is loaded including the version number. I've opened https://issues.apache.org/jira/browse/NUTCH-2741 to remove it. Best, Sebastian On 28.09.19 17:54, lewis john mcgibbney wrote: > Hi Seb, > > On Thu, Sep 26, 2019 at 4:37 AM wrote: > >> From: Sebastian Nagel >> To: user@n

[VOTE] Release Apache Nutch 2.4 RC#1

2019-09-24 Thread Sebastian Nagel
Hi Folks, A first candidate for the Nutch 2.4 release is available at: https://dist.apache.org/repos/dist/dev/nutch/2.4/ The release candidate is a zip and tar.gz archive of sources in: https://github.com/apache/nutch/tree/release-2.4 In addition, a staged maven repository is available

Re: parser.html.NodesToExclud

2019-09-12 Thread Sebastian Nagel
Hi Dave, the boilerplate removal (boilerpipe) works if parse-tika is used for parsing, but the parser.html.NodesToExclude property belongs to a feature which never made it into the code base, see https://issues.apache.org/jira/browse/NUTCH-585 Or do you work with a patched version? Best,

[DISCUSS] Release 1.16?

2019-09-02 Thread Sebastian Nagel
Hi all, more than 90 issues are fixed now: https://issues.apache.org/jira/projects/NUTCH/versions/12343430 The last release (1.15) is already more than one year ago (July 25, 2018). It's time! Of course, we'll check all remaining issues whether they should be fixed now or can be moved to be

Re: Few inner links are not opening.

2019-08-20 Thread Sebastian Nagel
log of any issues that need to be resolved for the wiki? > > Regards, > Sid > > -Original Message- > From: Sebastian Nagel > Sent: August 10, 2019 2:43 AM > To: user@nutch.apache.org > Subject: Re: Few inner links are not openi

Re: Few inner links are not opening.

2019-08-20 Thread Sebastian Nagel
, Sadiki Latty wrote: > Hey Sebastian, > > I have signed up for an account I will try to help out where/when I can. Is > there a list/backlog of any issues that need to be resolved for the wiki? > > Regards, > Sid > > -Original Message----- > From: Sebastian Nagel &

Re: Few inner links are not opening.

2019-08-10 Thread Sebastian Nagel
Thanks, it's fixed now. The wiki has been migrated recently and looks like that inner links haven't been properly converted. If anybody is eager to help us and improve the Nutch wiki - you're welcome! Please assign for an account in the wiki. Nutch is a community project and we need your help.

Re: Nutch Wiki migrated

2019-07-26 Thread Sebastian Nagel
s, > Furkan KAMACI > > On Fri, Jul 26, 2019 at 12:39 PM Sebastian Nagel > wrote: > > Hi all, > > the Nutch wiki has been migrated from MoinMoin to Confluence. > > You'll find it now on >   https://cwiki.apache.org/confluence/display/NUTCH/Hom

Nutch Wiki migrated

2019-07-26 Thread Sebastian Nagel
Hi all, the Nutch wiki has been migrated from MoinMoin to Confluence. You'll find it now on https://cwiki.apache.org/confluence/display/NUTCH/Home Work on improving the Wiki - updating information and moving outdated stuff into "Archive and Legacy" - is ongoing. Help is welcome, if you want

Re: Need Nutch to Index to Different Folder

2019-07-23 Thread Sebastian Nagel
Hi, if server S3 has Solr running, this would be a simple change of - (Nutch 1.14) just change the property solr.server.url - (Nutch 1.15) see https://wiki.apache.org/nutch/IndexWriters Best, Sebastian On 7/22/19 5:30 PM, Rushi wrote: > Hi All, > I need some help on this ,I have two different

Re: multiple values encountered for non multiValued field keywords

2019-07-17 Thread Sebastian Nagel
Hi Ryan, could be caused by the managed schema. Note for Solr 7.x updating the schema.xml alone may be not sufficient, see https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search Let us know whether this works. Thanks! And we'll update the wiki page, resp. in the new wiki:

Re: IllegalArgumentException: No form exists: user-login-form

2019-07-10 Thread Sebastian Nagel
Let me try that. >> >> >> On Tue, Jul 9, 2019 at 10:15 AM Sebastian Nagel >> wrote: >> >>> Hi Ryan, >>> >>> there is one: >>> >>> >> action="/user/login" >>> method="post" id="

Re: IllegalArgumentException: No form exists: user-login-form

2019-07-09 Thread Sebastian Nagel
value="spid3r_us"/> >> >> >>> value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) >> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 >> Safari/537.36" >> /> &

Re: IllegalArgumentException: No form exists: user-login-form

2019-07-03 Thread Sebastian Nagel
Hi, the error message is quite clear: > 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication - No form > element found with 'id' = user-login-form, trying 'name'. > 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication - No form > element found with 'name' = user-login-form

Re: Scoring-similarity plugin for Nutch 2.3.1

2019-06-28 Thread Sebastian Nagel
Hi Gajanan, > Can the *scoring-similarity plugin* for Nutch 1.x be *modified* to run with > nutch 2.3.1? if yes, how? Eventually, yes. Have a look at the differences of another scoring filter plugin between 1.x and 2.x, and try to apply those to scoring-similarity. > Can somebody guide me on

Re: AW: Nutch 1.15 IndexWriter -- how to explicitly choose one?

2019-05-27 Thread Sebastian Nagel
leaving the rest in a common directory does the trick! > Being able to configure the file names would sure be nice but for now I don't > mind having separate directories. > > Felix > >> Von: Sebastian Nagel >> >> Hi Felix, >> >> assumed that every t

Re: AW: Nutch 1.15 not respecting robots=noindex?

2019-05-23 Thread Sebastian Nagel
ta) > > the parse metadata only contains "metatag.robots" while with this setup > > protocol-httpclient|parse-(html|metatags)|index-(metadata) > > the parse metadata contains both "metatag.robots" and "robots". > > Felix >

Re: Nutch 1.15 not respecting robots=noindex?

2019-05-22 Thread Sebastian Nagel
Hi Felix, I tried to reproduce the problem. The parse-metatags plugin only duplicates the "robots" metatags, adding it also as "metatag.robots" but keep the original "robots". That is the case using the current master: - with parse-metatags and metatags.names="robots" the ParseData object

Re: Nutch 1.15 IndexWriter -- how to explicitly choose one?

2019-05-22 Thread Sebastian Nagel
Hi Felix, assumed that every test crawl runs by its own not sharing resources with other test crawls (except the Nutch packages): you may just write a separate index-writers.xml for every test, place it in a separate directory and point NUTCH_CONF_DIR to this directory. This works only in local

Re: Nutch NTLM to IIS 8.5 - issues!

2019-04-28 Thread Sebastian Nagel
Hi Michael, can you provide a patch or pull request for the upgrade? There is an issue open since long [1] but the available patches are reported to raise further issues (see issue comments). The challenge is indeed to to test all the authentication options supported by protocol-httpclient

Re: Tracing crawled sites

2019-04-18 Thread Sebastian Nagel
Hi Ryan, you may have a look at the plugin scoring-depth. It tracks the depth (links away from one of the seeds) of a crawled page and could be modified to write also the parents (maybe only the first) into the CrawlDatum metadata. Best, Sebastian On 4/9/19 9:08 PM, Ryan Suarez wrote: >

Re: Nutch Rest Service Issues

2019-04-18 Thread Sebastian Nagel
Hi, in deploy mode there are usually also jars from the Hadoop installation in the classpath. That might cause the issue. Because the Hadoop job client communicates via HTTP with the other Hadoop components these conflicts are not easy to fix. You could try to build Nutch yourself adding

Re: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread Sebastian Nagel
7689 4698 > External: +48 123 42 0698 > Mobile: +48 723 680 278 > E-mail: hany.n...@hsbc.com  > __  > Protect our environment - please only print this if you have to! > > > -Original Message- > F

Re: how to find pages that are truly deleted/moved

2019-03-15 Thread Sebastian Nagel
e referring to?  > > Thanks > Srini > > On Thu, Mar 14, 2019 at 1:06 PM Sebastian Nagel <mailto:wastl.na...@googlemail.com>> wrote: > > > remove from index, but later we found that some valid pages (when we > curl > > them we get 200) are al

Re: OutOfMemoryError: GC overhead limit exceeded

2019-03-14 Thread Sebastian Nagel
Hi, if running in local mode, it's better passed via ENV to bin/nutch, cf. # Environment Variables # # NUTCH_JAVA_HOME The java implementation to use. Overrides JAVA_HOME. # # NUTCH_HEAPSIZE The maximum amount of heap to use, in MB. # Default is 1000. # # NUTCH_OPTS

Re: Nutch and HTTP headers

2019-03-11 Thread Sebastian Nagel
Hi, > Can Nutch index custom HTTP headers? Nutch stores the HTTP response headers if the property `store.http.headers` is true. The headers are saved as string concatenated by `\r\n` under the key `_response.headers_` in the content metadata. You can send the entire HTTP headers to the indexer

Re: JEXL and Exchanges

2019-03-06 Thread Sebastian Nagel
*Service-Disabled Veteran-Owned Small Business (SDVOSB)* > 763-323-3499 > dbeckst...@figleaf.com > > > On Tue, Mar 5, 2019 at 12:44 PM Sebastian Nagel > wrote: > >> Hi Dave, >> >> I'm by now means an expert of the JEXL syntax (cf. >> (http:

Re: JEXL and Exchanges

2019-03-05 Thread Sebastian Nagel
Hi Dave, I'm by now means an expert of the JEXL syntax (cf. (http://commons.apache.org/proper/commons-jexl/reference/syntax.html) but after a few trials the expression must be doc.getFieldValue('url')=~'.*/englishnews/.*' It's easy to test using the indexchecker, e.g. % bin/nutch indexchecker

Re: Nutch "null chmod 0644" Error o Inject Attempt on Windows Through Cygwin

2019-02-21 Thread Sebastian Nagel
eleted the > lock file, and changed > the permissions to 755. Still getting on error (image attached).  > > ---- > *From:* Sebastian Nagel > *Sent:* Wednesday, February 20, 2019 3:57 PM >

Re: Nutch "null chmod 0644" Error o Inject Attempt on Windows Through Cygwin

2019-02-20 Thread Sebastian Nagel
Hi, > "chmod 655 " Shouldn't it be "755"? Otherwise the user is not allowed to list the content of the directory which will definitely cause an error. The user running Nutch is required to have "rwx" permissions in the "crawldb" folder and all its subfolders. >

Re: Nutch 1.15 runtime/local does not run in Standalone mode

2019-02-20 Thread Sebastian Nagel
o use the Nutch server and monitor the jobs > and their statuses? I will then delete the failed ones. > > > Regards > Ameer > > > > On Wed, Feb 20, 2019 at 8:58 PM Sebastian Nagel > wrote: > >> Hi Ameer, >> >> (bringing this back to user@nutch -

Re: Increasing the number of reducer in Deduplication

2019-02-20 Thread Sebastian Nagel
Hi Suraj, the correct syntax would be: __bin_nutch dedup -Dmapreduce.job.reduces=32 "$CRAWL_PATH"/crawldb Hadoop configuration properties must be passed before remaining arguments and you need to pass them as -Dname=value To confirm: I use to run the dedup job with 1200 reducers on a CrawlDb

Re: Nutch 1.15 runtime/local does not run in Standalone mode

2019-02-20 Thread Sebastian Nagel
s > being created in the > *tmp* directory. It also seems slow to me. > > Regards > Ameer > > > > On Wed, Feb 20, 2019 at 6:10 AM Sebastian Nagel <mailto:wastl.na...@googlemail.com>> wrote: > > Hi Ameer, > > yes, you're correct.  If lau

Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

2018-12-21 Thread Sebastian Nagel
d. Thanks again. > > Thanks & Regards > Venkata MR > +91 98455 77125 > > -Original Message- > From: Venkata MR > Sent: 18 December 2018 16:40 > To: 'Sebastian Nagel' > Cc: user@nutch.apache.org > Subject: RE: Apache Nutch 2.3.1 not able to fetch conten

Re: nutch 1.15 index multiple cores with solr 7.5

2018-12-21 Thread Sebastian Nagel
Hi, Nutch loads all configuration files from the Java class path and picks the first file found on the class path (and ignores other files with the same name). If there are multiple crawls with different configurations, just place a crawl-specific configuration directory in front of the

Re: Unfetched URLs after TIME_LIMIT_FETCH

2018-12-18 Thread Sebastian Nagel
Yes. They don't get updated and stay in status db_unfetched and will be generated in the next cycle again. On 12/18/18 5:01 PM, Suraj Singh wrote: > Hello, > > I want to understand what happens to the URLs which remains Unfetched due to > fetch time limit. > Are they fetched in the subsequent

Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

2018-12-17 Thread Sebastian Nagel
Hi, > protocol-httpclient (as the websites are with https). With Nutch 1.15 protocol-selenium supports https. If protocol-httpclient is also active, it may be used instead of protocol-selenium. There is no need to activate it, the description in nutch-default.xml needs to be fixed, see

Re: mapred.child.java.opts

2018-12-10 Thread Sebastian Nagel
l: +48 123 42 0698 > Mobile: +48 723 680 278 > E-mail: hany.n...@hsbc.com  > __  > Protect our environment - please only print this if you have to! > > -Original Message- > From: Sebastian Nagel [m

Re: mapred.child.java.opts

2018-12-07 Thread Sebastian Nagel
Hi, yes, of course, the comments just one line above even encourages you to do so: # note that some of the options listed here could be set in the # corresponding hadoop site xml param file For most use cases this value is ok. Only if you're using a parsing fetcher with many threads you may

Re: URL filter rejecting the URLs

2018-12-03 Thread Sebastian Nagel
Hi, the pattern should work. Of course, you need to make sure that - there are no other patterns coming before in regex-urlfilter.txt which cause the URL to be rejected - other URL filters being active which reject the URL - make sure that the folder of the regex-urlfilter.txt you're editing

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-19 Thread Sebastian Nagel
; > bin/nutch parsechecker -dumpText > http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]] > Parse Metadata: > > So, default one provides empty metadata and no error messages. This is a bit > confusing. > > Thanks. > &

Re: unexpected Nutch crawl interruption

2018-11-19 Thread Sebastian Nagel
Hi Yossi, > I think in the case that you interrupt the fetcher, you'll have the problem > that URLs > that where scheduled to be fetched on the interrupted cycle will never be > fetched > (because of NUTCH-1842). Yes, but only if generate.update.crawldb is true which is not the case by

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-15 Thread Sebastian Nagel
> Is there any reasons to keep the default HTML plugin there? only for > maintenance ? >   > Semyon.  > > Sent: Thursday, November 15, 2018 at 2:23 PM > From: "Sebastian Nagel" > To: user@nutch.apache.org > Subject: Re: Quality problems of crawling. Parsing(Mi

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-15 Thread Sebastian Nagel
Hi Semyon, I've tried to reproduce your problems using the recent Nutch master (upcoming 1.16). I cannot see any issues, except that Javascript is not executed but that's clear. Of course, you are free to use parse-tika instead of parse-html which is legacy. See results below. Best, Sebastian

Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Sebastian Nagel
Hi Nicholas, looks like it's the user-agent string sent in the HTTP header which makes the server return no/empty content. bin/nutch parsechecker \ -Dhttp.agent.name="mytestbot" \ -Dhttp.agent.version=3.0 \ -Dhttp.agent.url=http://example.com/ https://whatdavidread.ca/ Obviously, the

Re: After upgrading Mac OS to Mojave 10.14, Nutch is trying to inject from the .DS_Store file inside its seed folder.

2018-10-29 Thread Sebastian Nagel
Hi, thanks for the problem report. However, I would argue not handle such specificic cases inside Nutch, it makes the Nutch code extremely complex and requires extra efforts to be portable among operating systems. Why not just make the file invisible again? Or if this isn't possible: - write

Re: Nutch 1.15: crawling single web page resulting in crawldb-DB_UNFETCHED counter decreasing until 0

2018-10-22 Thread Sebastian Nagel
Hi Marco, did you increase http.content.limit The default is 64 kB, saturn.de pages are much larger and it may happen that the first 64 kB contain always the same set of navigation links (linking to product categories here). Feel free to open an issue on

Re: Connect Solr and Nutch in Ubuntu 18

2018-10-05 Thread Sebastian Nagel
Hi Timeka, > because Solr is missing the > files from its packet for it to work. There are many Solr versions available and it easily may happen that the description in the Wiki is outdated or not applicable for your combination of Nutch and Solr. Please try to give as much information as

Re: Regex to block some patterns

2018-10-05 Thread Sebastian Nagel
Hi Amarnath, the only possibility is that https://www.abc.com/ is skipped - by another rule in regex-urlfilter.txt - or another URL filter plugin Please check your configuration carefully. You may also use the tool bin/nutch filterchecker to test the filters beforehand: every active filter

Re: Nutch integration with Solr

2018-10-01 Thread Sebastian Nagel
Wiki > where is says to install Solr I don't understand the directions given that > lead up to creating a nutch core..how do I copy resources and manage > schema,etc..the breakdown confuses me.. Thank you again > > Timeka > > On Mon, Oct 1, 2018, 7:12 AM Sebastian Nagel >

Re: Nutch integration with Solr

2018-10-01 Thread Sebastian Nagel
Hi Timeka, well, the really short answer is: Nutch sends "documents" to Solr using the Solr4j client library. A "document" is a single web page fetched, parsed and split into indexable fields, e.g., "title", "keywords", "content". For further information you may look into

Re: Include parent URL in pdf data - nutch

2018-09-28 Thread Sebastian Nagel
Hi, could you explain in detail what is meant by "parent URL"? - the page the PDF document is linked from - a redirect pointing to the PDF doc - the "directory" of the PDF URL (clip URL after last "/") - ... Nutch indexes all successfully fetched pages but not redirects, 404s, etc. Of course,

Re: crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

2018-09-10 Thread Sebastian Nagel
Hi, crawling and indexing Office documents should work out-of-the-box without any configuration changes, the plugin parse-tika is enabled by default in recent Nutch versions. The only recommended change is to increase the content limit: http.content.limit 65536 The length limit for

Re: IndexWriter interface in 1.15

2018-09-09 Thread Sebastian Nagel
Hi Yossi, hi Lewis, actually, this is caused by a change of the IndexWriter interface as part of NUTCH-1480 (multiple index writers of same type). It's reported as a breaking change, but only the need to change the way how the index writers are configured. Sorry, we've missed to add a note

Re: bin/crawl not working

2018-08-15 Thread Sebastian Nagel
Hi, please also note that the way the index writer plugins are configured has changed with 1.15, see release notes and https://wiki.apache.org/nutch/bin/nutch%20index. The Solr URL cannot be passed anymore via -Dsolr.server.url=... I'll update the bin/crawl wiki page. Thanks, Sebastian On

[ANNOUNCE] Apache Nutch 1.15 Release

2018-08-10 Thread Sebastian Nagel
The Apache Nutch [0] Project Management Committee are pleased to announce the immediate release of Apache Nutch v1.15. We advise all current users and developers of the 1.X series to upgrade to this release. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained

[RESULT] was [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-07 Thread Sebastian Nagel
Hi Folks, thanks to everyone who was able to review the release candidate! 72 hours have passed, please see below for vote results. [4] +1 Release this package as Apache Nutch 1.15 Roannel Fernández Hernández * Govind Nitk Markus Jelsma * Sebastian Nagel * [0] -1 Do not release

Re: [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-01 Thread Sebastian Nagel
Hi Markus > 2018-08-01 11:42:10,660 INFO fetcher.FetcherThread - FetcherThread 47 > fetching https://en.wikipedia.org/wiki/Special:RecentChanges (queue crawl delay=5000ms) Ok, non-blocking because of: User-agent: * Disallow: /wiki/Special: > 2018-08-01 11:42:10,660 INFO fetcher.FetcherThread

Re: A couple of basic questions re scheduled crawls.

2018-07-26 Thread Sebastian Nagel
Hi Fred, as soon as you generate the fetch list (if you call bin/crawl this is done) and the CrawlDb contains at this time items with a (re)fetch date in the past, you'll get an non-empty fetch list and Nutch will (re)fetch those pages. You always have to call bin/crawl explicitly. Of course,

[VOTE] Release Apache Nutch 1.15 RC#1

2018-07-26 Thread Sebastian Nagel
Hi Folks, A first candidate for the Nutch 1.15 release is available at: https://dist.apache.org/repos/dist/dev/nutch/1.15/ The release candidate is a zip and tar.gz archive of the binary and sources in: https://github.com/apache/nutch/tree/release-1.15 The SHA1 checksum of the archive

Re: Crawling/Indexing Issue on Dev and staging Sever Urls

2018-07-23 Thread Sebastian Nagel
2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: boost >> dest: >>> boost >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: digest >> dest: >>> digest >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: tsta

Re: Crawling/Indexing Issue on Dev and staging Sever Urls

2018-07-23 Thread Sebastian Nagel
; org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > a

Re: Crawling/Indexing Issue on Dev and staging Sever Urls

2018-07-20 Thread Sebastian Nagel
Hi, > * Changed my regex-filter to use development domain address. Did you also change your seeds? The fact that deletions are sent but not additions/updates suggests that no pages have been successfully crawled. Could you specify the Nutch version used and also attach some log snippets to

[ANNOUNCE] New Nutch committer and PMC -

2018-06-26 Thread Sebastian Nagel
Dear all, it is my pleasure to announce that Roannel Fernández Hernández has joined us as a committer and member of the Nutch PMC. Recently, Roannel contributed a long list of improvements related to the indexer plugins: a new indexer for RabbitMQ, the possibility to index into multiple

Re: NoClassDefFoundError

2018-06-25 Thread Sebastian Nagel
Hi Robert, why not switching on boilerpipe for parse-tika? tika.extractor none Which text extraction algorithm to use. Valid values are: boilerpipe or none. tika.extractor.boilerpipe.algorithm ArticleExtractor Which Boilerpipe algorithm to use. Valid values are:

[ANNOUNCE] New Nutch committer and PMC - Omkar Reddy

2018-06-21 Thread Sebastian Nagel
Dear all, it is my pleasure to announce that Omkar Reddy has joined us as a committer and member of the Nutch PMC. Omkar has worked on upgrading Nutch to use the new MapReduce API as part of his Google Summer of Code project last year. Thanks, Omkar, and congratulations on your new role within

Re: Blacklisting TLDs

2018-06-17 Thread Sebastian Nagel
Hi Michael, on the Common Crawl Nutch fork there is a plugin "fast-urlfilter" which does this, see https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java It uses exactly this concept of "domain", i.e.,

Re: Opengraph metadata was [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

2018-06-12 Thread Sebastian Nagel
definition in nutch-default.xml On 06/12/2018 02:26 PM, BlackIce wrote: > PS: Does this work when configured in site.xml like regular metatdata? > > On Tue, Jun 12, 2018 at 1:31 PM BlackIce wrote: > >> sweet thnx! >> >> On Tue, Jun 12, 2018 at 1:29 PM Sebastian Nage

Re: Opengraph metadata was [MASSMAIL]Re: Preparing to release Nutch 1.15 ?

2018-06-12 Thread Sebastian Nagel
t; ++1! >>> >>> >>> >>> Sounds great. >>> >>> >>> >>> Cheers, >>> >>> Chris >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> From: Sebastian Nagel >>>

Preparing to release Nutch 1.15 ?

2018-06-11 Thread Sebastian Nagel
crawl on Hadoop mid of this week. But any help in testing is welcome. Note that the tutorial needs to be updated (will be done after 1.15 is finally released) to reflect the changes related to NUTCH-1480. Thanks, Sebastian [1] https://github.com/sebastian-nagel/nutch-test-single-node-cluster [2

Re: Sitemap URL's concatenated, causing status 14 not found

2018-06-07 Thread Sebastian Nagel
Hi Markus, ok, no problem. Done: https://github.com/crawler-commons/crawler-commons/issues/213 Sebastian On 06/07/2018 12:21 AM, Markus Jelsma wrote: > Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine > Github account. If you would do the honours of opening a

Re: Sitemap URL's concatenated, causing status 14 not found

2018-05-29 Thread Sebastian Nagel
> I agree that the this is not the ideal error behaviour, but I guess the code > was written from the assumption that the document is valid and conformant. Over time the crawler-commons sitemap parser has been extended to get as much as possible from non-conforming sitemaps as well. Of course,

Re: Nutch 1.14 not crawling all links?

2018-05-10 Thread Sebastian Nagel
Hi Bob, it's impossible to make any diagnostics without the full log files the complete configuration and a detailed description what is missing. It could be a bug, of course. But it's more likely a configuration issue, you should check the log files. Also have a look at: - the robots.txt of the

Re: Having plugin as a separate project

2018-05-04 Thread Sebastian Nagel
That's trivial. Just run ant in the plugin's source folder: cd src/plugin/urlnormalizer-basic/ ant or to run also the tests cd src/plugin/urlnormalizer-basic/ ant test Note: you have to compile the core test classes first by running ant compile-core-test in the Nutch "root" folder.

Re: Nutch fetching times out at 3 hours, not sure why.

2018-04-30 Thread Sebastian Nagel
nore it unless it causes a problem for my other cores. > > Chip > > -Original Message- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Sent: Monday, April 30, 2018 12:21 PM > To: user@nutch.apache.org > Subject: Re: Nutch fetching times out at 3 hou

Re: Nutch fetching times out at 3 hours, not sure why.

2018-04-30 Thread Sebastian Nagel
Hi, if you still see the log message fetcher.FetchItemQueues - * queue: https://history.aip.org >> dropping! then it can be only - fetcher.timelimit.mins - fetcher.max.exceptions.per.queue > I crawl a list of roughly 2600 URLs all on my local server If this is the case you can crawl more

Re: spilled records from reducer

2018-04-13 Thread Sebastian Nagel
Hi Michael, > reducer spills a lot of records The job counter "Spilled Records" is not for the reducers alone. > 255K input records Does your CrawlDb only contain 250,000 entries? Also, how many hosts (resp. domains/ips depending on partition.url.mode) are in the CrawlDb? Note: the counts per

Re: how do fetch wait times work?

2018-04-11 Thread Sebastian Nagel
Hi Fred, Nutch does nothing "proactively", the crawl jobs must be explicitly called. But you need no special command: - let's say the you didn't change the defaults and db.fetch.interval.default == 30 days - if you launch bin/crawl one month later, all pages are refetched, and optionally

Re: BinaryContent or Base64 Options

2018-03-27 Thread Sebastian Nagel
Hi Eric, the ability to add binary content was implemented in Nutch 1.11, you need to upgrade (an upgrade to 1.14 is recommended). The command-line help of $NUTCH_HOME/bin/nutch index indicates how to add a Solr field with the "binary" HTML content: Usage: Indexer ... [-addBinaryContent]

Re: how could I identify obsolete segments?

2018-03-23 Thread Sebastian Nagel
Hi Michael, when segments are merged only the most recent record of one URL is kept. Sebastian On 03/23/2018 09:25 PM, Michael Coffey wrote: > Greetings Nutchlings, > > How can I identify segments that are no longer useful, now that I have been > using AdaptiveFetchSchedule for several

Re: Nutch 1.11 SSLHandshakeException

2018-03-20 Thread Sebastian Nagel
> > > On Tue, Mar 20, 2018 at 3:31 AM, Sebastian Nagel <wastl.na...@googlemail.com >> wrote: > >> Hi Robert, >> >> unfortunately, I'm not able to reproduce the problem. >> Fetching works with the recent 1.x and Java 8, I've tried both: >> >>

Re: Is there any way to block the hubpages while crawling

2018-03-20 Thread Sebastian Nagel
Hi, > more control over what is being indexed? It's possible to enable URL filters for the indexer: bin/nutch index ... -filter With little extra effort you can use different URL filter rules during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR to a different folder. >> I

Re: Nutch 1.11 SSLHandshakeException

2018-03-20 Thread Sebastian Nagel
Hi Robert, unfortunately, I'm not able to reproduce the problem. Fetching works with the recent 1.x and Java 8, I've tried both: bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' https://potomac.edu/ bin/nutch parsechecker

Re: Fetcher error when running on Amazon EMR with S3

2018-03-16 Thread Sebastian Nagel
Hi John, the recent master has seen an upgrade to the new MapReduce API (NUTCH-2375), it was a huge change which is already known to have introduced some issues. For production it's recommended to use 1.14 and if necessary patch it. Could you open a new issue on

Re: Reg: URL Near Duplicate Issues with same content

2018-03-15 Thread Sebastian Nagel
Hi Shiva, 1. you can define URL normalizer rules to rewrite the URLs but it only works for sites where you know which URL is the canonical form. 2. you can deduplicate (command "nutch dedup") based on the content checksum: the duplicates are still crawled but deleted afterwards It's

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
gt; seems like I am not able to reopen a closed/resolved issue. Sorry... > >> -Original Message- >> From: Sebastian Nagel <wastl.na...@googlemail.com> >> Sent: 12 March 2018 17:39 >> To: user@nutch.apache.org >> Subject: Re: UrlRegexFilter is gett

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
> example. The only other place I can think of where this may be needed is > after redirect. > This is pretty much the same as what Semyon suggests, whether we push it down > into the filterNormalize method or do it before calling it. > > Yossi. > >> -Orig

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long >> links >> Some regular expressions (those with backtracing) can be very expensive for >> lomg strings >> >> https://regular-expressions.mobi/catastrophic.html?wlr=1 >> >> Maybe that

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
Good catch. It should be renamed to be consistent with other properties, right? On 03/12/2018 01:10 PM, Yossi Tamari wrote: > Perhaps, however it starts with db, not linkdb (like the other linkdb > properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code > uses the property

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Sebastian Nagel
Hi Yossi, it's used in FetcherThread and ParseOutputFormat: git grep -F db.max.outlinks.per.page However, it's not to limit the length of single outlink in characters but the number of outlinks followed (added to CrawlDb). There was NUTCH-1106 to add a property to limit the outlink length.

Re: dealing with redirects from http to https

2018-03-09 Thread Sebastian Nagel
> Another problem is that they have fetch_time well into the future, > I guess because retry_interval is applied. Correct. Fetch time is - time when to fetch next for a CrawlDatum in the CrawlDb - time when fetch has happened for those in segments crawl_fetch folder On 03/09/2018 11:04 PM,

Re: dealing with redirects from http to https

2018-03-09 Thread Sebastian Nagel
> What is the best way to handle this, in general? I am thinking of specifying > http.redirect.max=1 (rather than the default 0) in nutch-site.xml because I want it to fetch these pages right away, rather than waiting until the next cycle. Of course, you can do this. But keep in mind: if both,

<    1   2   3   4   5   6   7   >