The Apache Nutch Project https://nutch.apache.org/download/
Please verify signatures using the KEYS file
https://raw.githubusercontent.com/apache/nutch/master/KEYS when downloading
the release.
This release includes more than 60 bug fixes and improvements, the full
list of changes can be seen in
ttee-binding
The Nutch 1.20 release candidate has passed the community VOTE. I will
therefore promote this release casndidate.
Thanks for VOTE’ing and for everyone who contributed to the Apache Nutch
1.20 release.
lewismc
On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney
wrote:
> H
Hi Sheham,
On 2024/04/20 08:47:41 Sheham Izat wrote:
> The Fetcher job was aborted, does that still mean that it went through the
> entire list of seed urls?
Yes it processed the entire generated segment but the fetcher…
* hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/,
Hi Sheham,
On 2024/04/19 15:18:01 Sheham Izat wrote:
>
> My questions are:
>
> 1) What do I need to do to get Nutch to continue working even if there are
> hung threads?
>From what I can see in the log you provided, nothing is preventing Nutch from
>continuing to work. The Fetcher job
Hi user@, dev@,
Please consider reviewing the Nutch 1.20 release candidate. This is a
critical prerequisite for us making releases of software at TheASF.
Thank you
lewismc
On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney
wrote:
> Hi Folks,
>
> A first candidate for the Nutch 1.2
Hi Folks,
A first candidate for the Nutch 1.20 release is available at [0] where
accompanying SHA512 and ASC signatures can also be found.
Information on verifying releases can be found at [1].
The release candidate comprises a .zip and tar.gz archive of the sources at
[2] and complementary
Hi user@ & dev@,
I decided to write up a GSoC’24 proposal and encourage interested
applicants to register your interest in the JIRA issue or else reach
out to the Nutch PMC over on d...@nutch.apache.org (please CC
lewi...@apache.org).
Title: Overhaul the legacy Nutch plugin framework and replace
+1 Tim.
On Wed, Sep 13, 2023 at 16:50
>
>
>
> -- Forwarded message --
> From: Tim Allison
> To: user@nutch.apache.org, d...@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 13 Sep 2023 10:50:08 -0400
> Subject: [DISCUSS] Removing Any23 from Nutch?
> All,
> I opened
Hi Mike,
Yes it is possible to extend the TLD list. In fact, when the TLD lost was
compiled the author left a note explicitly stating that it may not be
complete.
https://github.com/apache/nutch/blob/master/conf/domain-suffixes.xml.template
Please submit a PR if you wish to make any changes or
I created https://issues.apache.org/jira/browse/NUTCH-2931 to track all of
this work.
If you are interested in working on any of this it would be great to
collaborate.
There is much more we can do over and above the few tickets I created.
lewismc
On 2021/12/24 10:07:20 sw.l...@quandatics.com
Hi Shi Wei,
I missed this thread over the holidays!
Which version of Nutch are you using?
The REST API needs quite a bit of attention. It is not a particularly mature
aspect of the Nutch codebase and there are a catalog of issues which needs to
be addressed.
If you are interested in learning
Hi user@, dev@,
I took the liberty of setting up a #nutch channel for our community to
communicate in a lower latency manner.
First join the-asf.slack.com Slack workspace
https://infra.apache.org/slack.html
Then simply join the #nutch channel.
See you there :)
Thanks
lewismc
--
Hi Roseline,
Looks like you are ignoring external URLs… that could be the problem right
there.
I encourage you to track counters on inject, generate and fetch phases to
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we
can
Hi Clark,
This is a lot of information... thank you for compiling it all.
Ideally the version of Hadoop being used with Nutch should ALWAYS match the
hadoop binaries referenced in
https://github.com/apache/nutch/blob/master/ivy/ivy.xml. This way you wont run
into the classpath issues.
I would
nUpCycle: 5000
> gridTimeout: 360
> gridBrowserTimeout: 120
> gridMaxSession: 5
> gridUnregisterIfStillDownAfter: 60
> chrome:
> enabled: true
> image: "selenium/node-chrome"
> tag: "3.141.59"
> replicas: 60
> nodeMaxSession: 5
> nodeRegistr
Hi user@,
Are you interested in the Nutch Dockerfile? If so, keep reading.
We are looking for some assistance to test proposed additions to the Nutch
Dockerfile.
Essentially the changes would facilitate installing and running the Nutch REST
server and/or the Nutch WebApp in addition to the Nutch
Hi Abhay,
On 2021/06/10 22:27:42, Abhay Ratnaparkhi wrote:
>
> Based on selenium I created a microservice (which handles all required SSO
> redirections/ OTP handlings etc) and hosted that with a selenium grid in
> the kubernetes cluster for scaling.
> I found that we couldn't scale this
, 2021 at 17:36 wrote:
>
> user Digest 13 Jun 2021 00:36:36 - Issue 3108
>
> Topics (messages 34633 through 34634)
>
> Re: Apache Nutch help request for a school project :)
> 34633 by: lewis john mcgibbney
>
> Re: Crawling pages behind SSO authentication
; On 2021-06-07 20:08, lewis john mcgibbney wrote:
> > Yep Sebastian is absolutely correct. I sent you a pull request.
> >
> > https://github.com/gorkemyontem/nutch/pull/1
> > HTH
> > lewismc
> >
> > On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney
>
Hi Abhay,
This is a problem space we looked at a while ago and made quite a bit of
progress on.
Firstly, the protocol-httpclient plugin has been considered in a deprecated
state for a while.
https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
I'm pretty sure that it will
Yep Sebastian is absolutely correct. I sent you a pull request.
https://github.com/gorkemyontem/nutch/pull/1
HTH
lewismc
On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney
wrote:
> I’ll have a look today. You can always use the mailing list as well. Feel
> free to post your que
I’ll have a look today. You can always use the mailing list as well. Feel
free to post your questions there and we will help you out :)
On Sun, Jun 6, 2021 at 12:43 gokmen.yontem
wrote:
> Hi Lewis,
> Sorry to bother you. I've been trying to configure Apache Nutch for
> almost 10 days now and
Some interesting content for a short read :)
https://www.seroundtable.com/duplexweb-google-bot-31522.html?utm_source=search_engine_roundtable_campaign=ser_newsletter_2021-06-03_medium=email
--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc
Hi Sebastian,
If we did not know how long our crawl infrastructure was required for (i.e.
the customer may revoke or extend the contract with very little notice) we
always chose AWS EMR. Specifically to reduce costs we made sure that all
worker/task nodes were run on spot instances
Hi Prateek,
mapred.map.tasks -->mapreduce.job.maps
mapred.reduce.tasks -->mapreduce.job.reduces
You should be able to override in these in nutch-site.xml then publish to your
Hadoop cluster.
lewismc
On 2021/05/07 15:18:38, prateek wrote:
> Hi,
>
> I am trying to crawl URLs
Hi Seb,
Really interesting. Thanks for the response. Below
On 2021/05/05 11:42:04, Sebastian Nagel
wrote:
>
> Yes, but not directly - it's a multi-step process.
As I expected ;)
>
> This Parquet index is optimized by sorting the rows by a special form of the
> URL [1] which
> - drops
Hi user@,
Has anyone experimented/accomplished either
1) writing Nutch data directly as Parquet format, or
2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet
format?
Thank you
lewismc
*What?*
The Apache Nutch team is pleased to announce the release of Apache Nutch
v1.18.
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™ data structures.
*Where?*
Source and binary distributions are available for
user@, dev@,
The 72hr VOTE'ing period has elapsed. The RESULT's are as follows
[5] +1 Release this package as Apache Nutch 1.18.
Lewis John McGibbney*
Ralf Kotowski*
Jorge Luis Betancourt Gonzalez*
Sebastian Nagel*
Shashanka Balakuntala Srinivasa*
[0] -1 Do not release this package because
Hi Folks,
A first candidate for the Nutch 1.18 release is available at [0] where
accompanying SHA512, ASC and MD5 signatures can also be found.
Information on verifying releases can be found at [1].
The release candidate is a .zip and tar.gz archive of the sources in [2]
In addition, a staged
Hi Prateek,
On 2021/01/19 15:58:29, prateek wrote:
> Is the only other option is to
> override HtmlParseFilter and add a new plugin?
Yes I think it is.
>
> Also regarding separate objects, what i meant is if i store the image links
> in Outlink, then those links will also be stored in DB
Hi prateek,
Please see my comment inline below
On Thu, Jan 14, 2021 at 6:39 AM wrote:
>
> One of the requirements I have is to extract all
> the image and video links from the html in a separate object. Since I have
> the html content, I can use a library like jsoup to parse the content and
>
Hi Moldable,
Thanks for taking the time to write. Please see my responses inline below
On Wed, Sep 23, 2020 at 10:38 PM Moldable wrote:
> Hi,
>
> Sorry for the email sent randomly to you. I am trying to report a tiny
> issue in the nutch documentation and I can't figure out how to do it
>
Hi Gajalakshmi.G,
Firstly, it's important for me to state that Nutch 2.X is deprecated. No
more development ius happening on the 2.X branch.
That being said, please see my comments inline below
On Thu, Sep 17, 2020 at 7:45 AM wrote:
>
> I am using Nutch 2.4 with Hadoop 3.1.1
To the best of my
Hi Jim,
Response below
On 2020/06/06 14:23:24, Jim Anderson wrote:
>
> I cannot find a download for Nutch 1.17. Is Nutch 1.17 available for
> download? If so, can someone please give me a pointer.
>
Nutch 1.17 is current master branch e.g. in development, meaning that there is
no official
Hi Seb,
Go for it. I’ll happily review.
Excellent work folks... really excellent work.
lewismc
On Wed, Apr 22, 2020 at 23:27 wrote:
>
> user Digest 23 Apr 2020 06:27:46 - Issue 3055
>
> Topics (messages 34517 through 34517)
>
> [DISCUSS] Release 1.17 ?
> 34517 by: Sebastian Nagel
>
Title: Nutch 2.3.1 affected by downstream dependency CVE-2016-6809
Vulnerable Versions: 2.3.1 (1.16 is not vulnerable)
Disclosure date: 2018-10-22
Credit: Pierre Ernst, Salesforce
Summary: Remote Code Execution in Apache Nutch 2.3.1 when crawling web site
containing malicious content
I've opened
>https://issues.apache.org/jira/browse/NUTCH-2741
> to remove it.
>
> Best,
> Sebastian
>
> On 28.09.19 17:54, lewis john mcgibbney wrote:
> > Hi Seb,
> >
> > On Thu, Sep 26, 2019 at 4:37 AM
> wrote:
> >
> >> From: Sebastian Na
Hi Seb,
On Thu, Sep 26, 2019 at 4:37 AM wrote:
> From: Sebastian Nagel
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org
> Bcc:
> Date: Tue, 24 Sep 2019 11:54:48 +0200
> Subject: [VOTE] Release Apache Nutch 2.4 RC#1
> Hi Folks,
>
> A first candidate for the Nutch 2.4 release is available
Hi Folks,
I've implemented what Dave suggested... it is clean and easy but it maybe
not quite as ad-hoc-capable as one would always want. For my use cases it
was acceptable.
More responses inline
On Thu, Sep 19, 2019 at 2:47 PM wrote:
> From: Jorge Betancourt
> To: user@nutch.apache.org
> Cc:
Hi user@ and dev@,
If you are a student and would like to tackle the task of Mavenizing the
Nutch master build please get in touch with me here, directly or comment on
the following issue
https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-2292
Thank you
Lewis
--
Hi Hany,
Yes the paramater is set to 1GB by default but it should also be noted that
this configuration key is actually deprecated as of some time ago. Seeing as we
are using the 'new' MapReduce API, I suspect we should use
'mapreduce.map.java.opts` and `mapreduce.reduce.java.opts` instead so
Hi Tukang,
In short yes. It would help if you could provide an example of what you've
tried and what you encountered/what your results were.
Lewis
On Mon, Dec 3, 2018 at 6:42 PM wrote:
>
> From: tkg_cangkul
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 04 Dec 2018 09:40:47 +0700
>
Hi Marcello,
I don't think this is correct no!
first however, I really suggest that we upgrade the Jest client in this
plugin. The most recent one is 6.3.1 and we are using 2.0.3.
Please see https://issues.apache.org/jira/browse/NUTCH-2677, if you are
able to provide a patch and test it out it
Hi Gahanna,
Response inline
On 2018/10/12 07:40:50, Gajanan Watkar wrote:
> Hi all,
> I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of
> Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to
> solr-6.5.1.
> I want to use *webapp* for creating, controlling
Hi Gajanan,
Seeing as you are using 2.x, are you making sure that the project has been
built with the correct regex-urlfilter.txt being present on ClassPath and
included in the job jar you are using?
On Thu, Oct 11, 2018 at 12:19 AM wrote:
>
>
> From: Gajanan Watkar
> To:
Hi Gajanan,
CC dev@gora, this is something we may wish to implement within HBase.
If anything I've provided below is incorrect, then please correct the
record.
BTW, I found the following article written by Elis, to be extremely useful
Hi Gajanan,
Which OS are you running this on?
I would also suggest that if you want to use the 2.x codebase, you should
use the most recent from SCM e.g. check out master and change to 2.x branch.
Finally, for now at least, you didn't mention the phase at which the crawl
is failing. Can you
Hi Amarnatha,
There are a couple of options which I can think of.
1. Why don't you just set up a simple daemon to watch hadoop.log and
generate a subsequent stream writing it to /tmp/myurls.log e.g. tail -f
hadoop.log > /tmp/myurls.log
2. Check out confirmation/log4j.properties, you will see the
Hi Yossi,
REASON: Upgrade of MapReduce API from legacy to 'new'. This was a breaking
change for sure and a HUGE patch. We did not however factor in the
non-braking aspects of the upgrade... so it has not all been plain sailing.
PROPOSED SOLUTION: I tend to agree with you that this should be addd
Hi Rustam,
There have been some efforts to Mavenize the entire build system. These all
died. If you look on JIRA you will see the relevant tickets for the most
recent implementation
https://issues.apache.org/jira/browse/NUTCH-2292
Our current build does not publish the Nutch plugins are Maven
Hi Puneet
Responses inline
On Wed, Aug 15, 2018 at 7:20 AM wrote:
>
> From: Puneet Dhanda
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 15 Aug 2018 10:02:12 -0400
> Subject: Nutch 2.3.1 with Mongo datastore - No Document is getting indexed.
> Hi,
>
> I am using the Nutch- 2.3.1 with
Excellent. Thanks for taking on release manager Seb, it’s making a huge
impact. Nice work folks.
On Tue, Aug 7, 2018 at 05:37 wrote:
>
> user Digest 7 Aug 2018 12:37:25 - Issue 2921
>
> Topics (messages 34124 through 34124)
>
> [RESULT] was [VOTE] Release Apache Nutch 1.15 RC#1
>
Excellent. Good job Omkar.
On Thu, Jun 21, 2018 at 1:18 AM, wrote:
>
>
> From: Sebastian Nagel
> To: d...@nutch.apache.org
> Cc: user@nutch.apache.org
> Bcc:
> Date: Thu, 21 Jun 2018 10:18:01 +0200
> Subject: [ANNOUNCE] New Nutch committer and PMC - Omkar Reddy
> Dear all,
>
> it is my
Hi Patricia,
I've never used a proxy auto-config (PAC) method for proxying anything
before. The PAC is defined as "...Proxy auto-configuration (PAC): Specify
the URL for a PAC file with a JavaScript function that determines the
appropriate proxy for each URL. This method is more suitable for
Hi Chip,
Which version of Nutch are you using?
On Tue, Apr 17, 2018 at 7:45 AM, wrote:
> From: Chip Calhoun
> To: "user@nutch.apache.org"
> Cc:
> Bcc:
> Date: Tue, 17 Apr 2018 14:45:01 +
> Subject: Nutch fetching
Hi Govind,
Please scope out https://github.com/apache/nutch/pull/306
Let me know how things go.
Lewis
On Mon, Apr 2, 2018 at 4:45 AM, wrote:
>
>
> From: govind nitk
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Mon, 2 Apr 2018
Patch it Markus.
On Wed, Mar 7, 2018 at 1:58 PM, wrote:
>
> From: Markus Jelsma
> To: User
> Cc:
> Bcc:
> Date: Wed, 7 Mar 2018 11:24:09 +
> Subject: index-metadata, lowercasing field names?
> Hi,
>
>
utch from master (rather than just copy a
> few jar files) and ensure that any23’s jar dependencies as also references..
>
> > On Feb 9, 2018, at 1:45 PM, Lewis John McGibbney <lewi...@apache.org>
> wrote:
> >
> > Hi David,
> > We are in the process of releasin
Comma-separated list of Any23 extractors (a list of
> extractors is available here:
> http://any23.apache.org/getting-started.html)
>
>
> I expected to see additional information from nutch parsechecker after adding
> the jsonld extractors, however I see NO changes t
Hi David,
Answers inline
On Thu, Feb 8, 2018 at 9:19 AM, wrote:
>
> From: David Ferrero
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Thu, 8 Feb 2018 10:19:52 -0700
> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and
Hi Sheon,
It looks like HTTPS is not currently supported
https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java#L63
I can't recall if we were ever successful in adapting the plugin for HTTPS
so I can't advise further.
k <govind.n...@gmail.com>
> wrote:
>
> >
> > hi Lewis,
> >
> > uname -a: Linux data 4.4.0-108-generic #131~14.04.1-Ubuntu SMP Sun Jan 7
> > 15:54:10 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> > On Tue, Jan 9, 2018 at 7:56 PM, lewis john mcgibbney
Hi govind,
Very strange. Which operating system are you using?
Lewis
On Tue, Jan 9, 2018 at 5:15 AM, wrote:
> From: govind nitk
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 9 Jan 2018 15:45:08 +0530
> Subject: Getting Error
>
Hi Sheon,
Assuming that you are using Nutch master branch, please read
https://github.com/apache/nutch/blob/master/src/plugin/lib-selenium/howto_upgrade_selenium.txt
https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
Make the relevant dependency updates and repackage the
Hi Devil,
Do your logs indicate any issues?
Lewis
On Mon, Dec 25, 2017 at 5:41 PM, wrote:
>
> -- Forwarded message --
> From: devil devil
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Fri, 22 Dec 2017 21:24:51 +0100
>
Hi Folks,
The Apache Gora team are pleased to announce the immediate availability of
Apache Gora 0.8.
The Apache Gora open source framework provides an in-memory data model and
persistence for big data. Gora supports persisting to
- column stores,
- key value stores,
- document stores,
Hi user@ and dev@,
As part of the Nutch Google Summer of Code effort this year, Omkar Reddy
and I have been working persistently throughout the summer months on the
Hadoop MapReduce API upgrade e.g. NUTCH-2375 Upgrade the code base from
org.apache.hadoop.mapred to org.apache.hadoop.mapreduce [0].
Hi Raziyeh,
Please see
https://wiki.apache.org/nutch/NutchRESTAPI#Configuration
Once you've created your new config, you can use it as follows
https://wiki.apache.org/nutch/NutchRESTAPI#Create_job
Lewis
On Fri, Aug 11, 2017 at 12:23 AM, wrote:
>
> From:
Hi Ray,
Apart from not being able to find a tutorial, what is wrong exactly?
New users of Nutch are advised to use the Nutch 1.X series.
The Nutch 2.X tutorial introduces more moving parts. This is well
documented on this mailing list for a number of years now.
If you can enumerate what is wrong,
Hi Folks,
I just updated the tutorial below, if you find any discrepancies please let
me know.
https://wiki.apache.org/nutch/NutchTutorial
Also, I have made available a new schema.xml which is compatible with Solr
6.6.0 at
https://issues.apache.org/jira/browse/NUTCH-2400
Please scope it out
Hi Pau,
On Sat, Jul 8, 2017 at 6:52 AM, wrote:
> From: Pau Paches
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Sat, 8 Jul 2017 15:52:46 +0200
> Subject: nutch 1.x tutorial with solr 6.6.0
> Hi,
> I have run the Nutch 1.x
Hi Dave,
Does this need to be done in parsing phase? Parsing is already an IO
intensive process... could you possible do it at another phase?
Right now, the only plugin I can think of which ships with Nutch source,
and which consults an external resource (not packaged with Nutch) is the
Hi Vyacheslav,
Thanks for the update, can you please open a ticket at
https://issues.apache.org/jira/projects/NUTCH
If you are able to submit a pull request at https://github.com/apache/nutch/,
it would be appreciated.
Lewis
On Sat, Jun 24, 2017 at 9:36 AM,
Hi Vyacheslav,
Can you provide me and example page with http refresh tag included? I'll
try comparing behaviour between 2.X and master.
Thank you
Lewis
On Sat, Jun 17, 2017 at 9:25 AM, wrote:
> From: Vyacheslav Pascarel
> To:
Hi Vyacheslav,
Which version of Nutch are you using? 2.x?
lewis
On Wed, Jun 21, 2017 at 10:32 AM, wrote:
>
>
> From: Vyacheslav Pascarel
> To: "user@nutch.apache.org"
> Cc:
> Bcc:
> Date: Wed, 21 Jun 2017
Hi Vyacheslav,
On Thu, Jun 15, 2017 at 1:41 AM, wrote:
>
> From: Vyacheslav Pascarel
> To: "user@nutch.apache.org"
> Cc:
> Bcc:
> Date: Wed, 14 Jun 2017 22:15:49 +
> Subject: Outlinks field is not populated
Hi Dennis,
On Thu, Jun 15, 2017 at 1:41 AM, wrote:
>
> From: Dennis A
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 14 Jun 2017 20:45:35 +0200
> Subject: Re: Optimize Nutch Indexing Speed
> Hi Lewis,
> thank you for your
Hi Dennis,
On Sun, Jun 11, 2017 at 2:45 AM, wrote:
>
> From: Dennis A
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Fri, 9 Jun 2017 09:59:05 +0200
> Subject: Optimize Nutch Indexing Speed
> Hello,
> I have recently configured my
Hi Roannel,
Markus worked on this quite a bit a while back. Please see
https://issues.apache.org/jira/browse/NUTCH-1480.
If you were able to pick this back up and update the patch with a pull
request I would happily review it, test it and provide feedback.
On Wed, Jun 14, 2017 at 7:42 AM,
Hi Folks,
The Nutch PMC recently VOTE'd in Blackice to formally join our Nutch
Project Management Committee and as a Project Committer.
Please join me in offering a friendly welcome... not that he needs it He's
been here for quite a while :)
@Blackice, feel free to say a bit about yourself if you
Forwarding with correct thread name.
-- Forwarded message --
From: lewis john mcgibbney <lewi...@apache.org>
Date: Mon, Jun 5, 2017 at 2:50 PM
Subject: Re: user Digest 3 Jun 2017 19:27:20 - Issue 2758
To: "user@nutch.apache.org" <user@nutch.apache.org>
Hi Ed,
Disappointing to hear that this really got under your skin... never nice to
hear that frustration becomes the outcome rather than successfully running
the software. I've provided comments below
On Sat, Jun 3, 2017 at 12:27 PM, wrote:
>
> From: Edward
Hi Yongyao,
The code in question is found below
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L230-L232
A few things come to mind here...
* are you sure the entries with a lower score than the minimum threshold
were not present before you established
Hi Folks,
Maven build is actually pretty close now. We need to bring the following
branch up-to-date with 1.14 then stabilize tests... then it is good to
propose as a PR for 1.14-SNAPSHOT.
Transferring this work over to 2.x will be much easier than the work done
for master branch.
I'm over on
Hello Folks,
The Apache Nutch [0] Project Management Committee are pleased to announce
the immediate release of Apache Nutch v1.13, we advise all current users
and developers of the 1.X series to upgrade to this release.
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
Hi Folks,
Thank you to everyone who was able to review the RC and VOTE, greatly
appreciated.
72 has come and gone, please see below for RESULT's.
[9] +1 Release this package as Apache Nutch 1.13.
Lewis John McGibbney *
Julien Nioche *
Kevin Ratnasekera
Chris A. Mattmann *
Furkan KAMACI *
Matei
Hi Yongyao,
In addition to Seb's response, please also check out the
'scoring.filter.order' property in nutch-site.xml
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1429-L1437
This will determine the order and provide you with more control over
complex scoring logic.
Lewis
Hi Folks,
A first candidate for the Nutch 1.13 release is available at:
https://dist.apache.org/repos/dist/dev/nutch/1.13/
The release candidate is a zip and tar.gz archive of the binary and sources
in:
https://github.com/apache/nutch/tree/release-1.13
The SHA1 checksum of the archive is
Hi suyash,
This issue can be addressed by essentially, commenting OUT all of the
instances where the WebPage [0] object is augmented within each job (and
possibly plugin).
An example would be as follows
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/parse/ParseUtil.java#L358
Hi Daniele,
In short, if I were you I would look into using the readdb resource
https://wiki.apache.org/nutch/bin/nutch%20readdb
This will enable you to take a peek into your MongoDB table and find out
which documents are present. By the looks of it from your Gist nothing is
being fetched and
Hi Michael,
On Sat, Nov 19, 2016 at 8:09 AM, wrote:
> From: Michael Coffey
> To: "user@nutch.apache.org"
> Cc:
> Date: Fri, 18 Nov 2016 21:15:14 + (UTC)
> Subject: indexing to Solr
> Where can I find
Hi Eyeris,
Replies inline
On Fri, Oct 28, 2016 at 8:51 PM, wrote:
> From: Eyeris Rodriguez Rueda
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 28 Oct 2016 09:43:59 -0400 (CDT)
> Subject: how to insert nutch into ambari ecosystem ?
> Hi all.
>
Hi Eyeris,
I've just tried Nutch master branch to parse outlinks from a number of RSS
Feeds, an example being 'http://www.jpl.nasa.gov/blog/feed/'. This works
perfectly with both the feed and parse-tika plugins. Outlinks are extracted
accordingly.
Can you provide an example of the RSS Feeds you
Hi Vladimir,
Responses inline
On Thu, Nov 10, 2016 at 1:05 AM, wrote:
> From: Vladimir Loubenski
> To: "user@nutch.apache.org"
> Cc:
> Date: Tue, 8 Nov 2016 17:53:59 +
> Subject: Nutch 2.3.1 REST calls to DB
Hi Michael,
Replies inline
On Sat, Nov 12, 2016 at 7:10 PM, wrote:
> From: Michael Coffey
> To: "user@nutch.apache.org"
> Cc:
> Date: Sun, 13 Nov 2016 03:07:16 + (UTC)
> Subject: How can I Score?
> When
Hi Joe,
On Fri, Oct 21, 2016 at 7:34 AM, wrote:
> From: Joe Adams
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 21 Oct 2016 10:34:15 -0400
> Subject: Nutch 2.3.1 elasticsearch tstamp
> I'm working on setting up nutch with elasticsearch
Hi Tom,
Please post your entire Nutch log for the inject and generate phase if
possible. It is near impossible to debug given the information you've
provided.
Thanks
On Fri, Oct 21, 2016 at 7:34 AM, wrote:
> From: Tom Chiverton
> To:
Hi Tom,
This looks like it has been frustrating for you so I've provided a walk
through of how I can set up a core using current Nutch 2.X schema.xml
On Mon, Oct 17, 2016 at 9:27 AM, wrote:
>
> From: Tom Chiverton
> To:
Hi Sachin,
Answering both of your questions here as I am catching up with some mail.
On Fri, Sep 30, 2016 at 5:04 AM, wrote:
>
> From: Sachin Shaju
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 30 Sep 2016 10:00:04 +0530
> Subject: Re:
1 - 100 of 1408 matches
Mail list logo