[ANNOUNCE] Apache Nutch 1.20 Release

2024-04-28 Thread lewis john mcgibbney
The Apache Nutch Project https://nutch.apache.org/download/ Please verify signatures using the KEYS file https://raw.githubusercontent.com/apache/nutch/master/KEYS when downloading the release. This release includes more than 60 bug fixes and improvements, the full list of changes can be seen in

[RESULT] WAS Re: [VOTE] Apache Nutch 1.20 Release

2024-04-24 Thread lewis john mcgibbney
ttee-binding The Nutch 1.20 release candidate has passed the community VOTE. I will therefore promote this release casndidate. Thanks for VOTE’ing and for everyone who contributed to the Apache Nutch 1.20 release. lewismc On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney wrote: > H

Re: Help posting question

2024-04-24 Thread Lewis John McGibbney
Hi Sheham, On 2024/04/20 08:47:41 Sheham Izat wrote: > The Fetcher job was aborted, does that still mean that it went through the > entire list of seed urls? Yes it processed the entire generated segment but the fetcher… * hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/,

Re: Help posting question

2024-04-19 Thread Lewis John McGibbney
Hi Sheham, On 2024/04/19 15:18:01 Sheham Izat wrote: > > My questions are: > > 1) What do I need to do to get Nutch to continue working even if there are > hung threads? >From what I can see in the log you provided, nothing is preventing Nutch from >continuing to work. The Fetcher job

Re: [VOTE] Apache Nutch 1.20 Release

2024-04-16 Thread lewis john mcgibbney
Hi user@, dev@, Please consider reviewing the Nutch 1.20 release candidate. This is a critical prerequisite for us making releases of software at TheASF. Thank you lewismc On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney wrote: > Hi Folks, > > A first candidate for the Nutch 1.2

[VOTE] Apache Nutch 1.20 Release

2024-04-09 Thread lewis john mcgibbney
Hi Folks, A first candidate for the Nutch 1.20 release is available at [0] where accompanying SHA512 and ASC signatures can also be found. Information on verifying releases can be found at [1]. The release candidate comprises a .zip and tar.gz archive of the sources at [2] and complementary

[GSoC 2024 PROPOSAL] Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread lewis john mcgibbney
Hi user@ & dev@, I decided to write up a GSoC’24 proposal and encourage interested applicants to register your interest in the JIRA issue or else reach out to the Nutch PMC over on d...@nutch.apache.org (please CC lewi...@apache.org). Title: Overhaul the legacy Nutch plugin framework and replace

Re: [DISCUSS] Removing Any23 from Nutch?

2023-09-14 Thread lewis john mcgibbney
+1 Tim. On Wed, Sep 13, 2023 at 16:50 > > > > -- Forwarded message -- > From: Tim Allison > To: user@nutch.apache.org, d...@nutch.apache.org > Cc: > Bcc: > Date: Wed, 13 Sep 2023 10:50:08 -0400 > Subject: [DISCUSS] Removing Any23 from Nutch? > All, > I opened

Re: user Digest 8 Nov 2022 10:16:05 -0000 Issue 3169

2022-11-08 Thread lewis john mcgibbney
Hi Mike, Yes it is possible to extend the TLD list. In fact, when the TLD lost was compiled the author left a note explicitly stating that it may not be complete. https://github.com/apache/nutch/blob/master/conf/domain-suffixes.xml.template Please submit a PR if you wish to make any changes or

Re: Unable to fetch data from segment folder

2022-01-11 Thread Lewis John McGibbney
I created https://issues.apache.org/jira/browse/NUTCH-2931 to track all of this work. If you are interested in working on any of this it would be great to collaborate. There is much more we can do over and above the few tickets I created. lewismc On 2021/12/24 10:07:20 sw.l...@quandatics.com

Re: Unable to fetch data from segment folder

2022-01-11 Thread Lewis John McGibbney
Hi Shi Wei, I missed this thread over the holidays! Which version of Nutch are you using? The REST API needs quite a bit of attention. It is not a particularly mature aspect of the Nutch codebase and there are a catalog of issues which needs to be addressed. If you are interested in learning

!! Join the #nutch Slack channel !!

2021-12-29 Thread lewis john mcgibbney
Hi user@, dev@, I took the liberty of setting up a #nutch channel for our community to communicate in a lower latency manner. First join the-asf.slack.com Slack workspace https://infra.apache.org/slack.html Then simply join the #nutch channel. See you there :) Thanks lewismc --

Re: Nutch not crawling all URLs

2021-12-13 Thread lewis john mcgibbney
Hi Roseline, Looks like you are ignoring external URLs… that could be the problem right there. I encourage you to track counters on inject, generate and fetch phases to understand where records may be being dropped. Are the seeds you are using public? If so please post your seed file so we can

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-16 Thread Lewis John McGibbney
Hi Clark, This is a lot of information... thank you for compiling it all. Ideally the version of Hadoop being used with Nutch should ALWAYS match the hadoop binaries referenced in https://github.com/apache/nutch/blob/master/ivy/ivy.xml. This way you wont run into the classpath issues. I would

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-07-08 Thread Lewis John McGibbney
nUpCycle: 5000 > gridTimeout: 360 > gridBrowserTimeout: 120 > gridMaxSession: 5 > gridUnregisterIfStillDownAfter: 60 > chrome: > enabled: true > image: "selenium/node-chrome" > tag: "3.141.59" > replicas: 60 > nodeMaxSession: 5 > nodeRegistr

Looking for ntesters - Nutch Dockerfile

2021-07-01 Thread Lewis John McGibbney
Hi user@, Are you interested in the Nutch Dockerfile? If so, keep reading. We are looking for some assistance to test proposed additions to the Nutch Dockerfile. Essentially the changes would facilitate installing and running the Nutch REST server and/or the Nutch WebApp in addition to the Nutch

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-07-01 Thread Lewis John McGibbney
Hi Abhay, On 2021/06/10 22:27:42, Abhay Ratnaparkhi wrote: > > Based on selenium I created a microservice (which handles all required SSO > redirections/ OTP handlings etc) and hosted that with a selenium grid in > the kubernetes cluster for scaling. > I found that we couldn't scale this

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-12 Thread lewis john mcgibbney
, 2021 at 17:36 wrote: > > user Digest 13 Jun 2021 00:36:36 - Issue 3108 > > Topics (messages 34633 through 34634) > > Re: Apache Nutch help request for a school project :) > 34633 by: lewis john mcgibbney > > Re: Crawling pages behind SSO authentication

Re: Apache Nutch help request for a school project :)

2021-06-10 Thread lewis john mcgibbney
; On 2021-06-07 20:08, lewis john mcgibbney wrote: > > Yep Sebastian is absolutely correct. I sent you a pull request. > > > > https://github.com/gorkemyontem/nutch/pull/1 > > HTH > > lewismc > > > > On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney >

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-09 Thread Lewis John McGibbney
Hi Abhay, This is a problem space we looked at a while ago and made quite a bit of progress on. Firstly, the protocol-httpclient plugin has been considered in a deprecated state for a while. https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient I'm pretty sure that it will

Re: Apache Nutch help request for a school project :)

2021-06-07 Thread lewis john mcgibbney
Yep Sebastian is absolutely correct. I sent you a pull request. https://github.com/gorkemyontem/nutch/pull/1 HTH lewismc On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney wrote: > I’ll have a look today. You can always use the mailing list as well. Feel > free to post your que

Re: Apache Nutch help request for a school project :)

2021-06-07 Thread lewis john mcgibbney
I’ll have a look today. You can always use the mailing list as well. Feel free to post your questions there and we will help you out :) On Sun, Jun 6, 2021 at 12:43 gokmen.yontem wrote: > Hi Lewis, > Sorry to bother you. I've been trying to configure Apache Nutch for > almost 10 days now and

DuplexWeb-Google - GoogleBot Crawler For Duplex / Google Assistant

2021-06-03 Thread lewis john mcgibbney
Some interesting content for a short read :) https://www.seroundtable.com/duplexweb-google-bot-31522.html?utm_source=search_engine_roundtable_campaign=ser_newsletter_2021-06-03_medium=email -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc

Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-03 Thread lewis john mcgibbney
Hi Sebastian, If we did not know how long our crawl infrastructure was required for (i.e. the customer may revoke or extend the contract with very little notice) we always chose AWS EMR. Specifically to reduce costs we made sure that all worker/task nodes were run on spot instances

Re: Crawling same domain URL's

2021-05-09 Thread Lewis John McGibbney
Hi Prateek, mapred.map.tasks -->mapreduce.job.maps mapred.reduce.tasks -->mapreduce.job.reduces You should be able to override in these in nutch-site.xml then publish to your Hadoop cluster. lewismc On 2021/05/07 15:18:38, prateek wrote: > Hi, > > I am trying to crawl URLs

Re: Writing Nutch data in Parquet format

2021-05-06 Thread Lewis John McGibbney
Hi Seb, Really interesting. Thanks for the response. Below On 2021/05/05 11:42:04, Sebastian Nagel wrote: > > Yes, but not directly - it's a multi-step process. As I expected ;) > > This Parquet index is optimized by sorting the rows by a special form of the > URL [1] which > - drops

Writing Nutch data in Parquet format

2021-05-04 Thread Lewis John McGibbney
Hi user@, Has anyone experimented/accomplished either 1) writing Nutch data directly as Parquet format, or 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format? Thank you lewismc

[ANNOUNCE] Apache Nutch 1.18 Release

2021-01-24 Thread lewis john mcgibbney
*What?* The Apache Nutch team is pleased to announce the release of Apache Nutch v1.18. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures. *Where?* Source and binary distributions are available for

[RESULT] WAS Re: [VOTE] Release Apache Nutch 1.18 RC1

2021-01-24 Thread lewis john mcgibbney
user@, dev@, The 72hr VOTE'ing period has elapsed. The RESULT's are as follows [5] +1 Release this package as Apache Nutch 1.18. Lewis John McGibbney* Ralf Kotowski* Jorge Luis Betancourt Gonzalez* Sebastian Nagel* Shashanka Balakuntala Srinivasa* [0] -1 Do not release this package because

[VOTE] Release Apache Nutch 1.18 RC1

2021-01-20 Thread lewis john mcgibbney
Hi Folks, A first candidate for the Nutch 1.18 release is available at [0] where accompanying SHA512, ASC and MD5 signatures can also be found. Information on verifying releases can be found at [1]. The release candidate is a .zip and tar.gz archive of the sources in [2] In addition, a staged

Re: Extract all image and video links from a web page

2021-01-20 Thread Lewis John McGibbney
Hi Prateek, On 2021/01/19 15:58:29, prateek wrote: > Is the only other option is to > override HtmlParseFilter and add a new plugin? Yes I think it is. > > Also regarding separate objects, what i meant is if i store the image links > in Outlink, then those links will also be stored in DB

Re: Extract all image and video links from a web page

2021-01-14 Thread lewis john mcgibbney
Hi prateek, Please see my comment inline below On Thu, Jan 14, 2021 at 6:39 AM wrote: > > One of the requirements I have is to extract all > the image and video links from the html in a separate object. Since I have > the html content, I can use a library like jsoup to parse the content and >

Re: NutchTutorial error

2020-09-24 Thread lewis john mcgibbney
Hi Moldable, Thanks for taking the time to write. Please see my responses inline below On Wed, Sep 23, 2020 at 10:38 PM Moldable wrote: > Hi, > > Sorry for the email sent randomly to you. I am trying to report a tiny > issue in the nutch documentation and I can't figure out how to do it >

Re: Facing Gora exception in Nutch 2.4

2020-09-20 Thread lewis john mcgibbney
Hi Gajalakshmi.G, Firstly, it's important for me to state that Nutch 2.X is deprecated. No more development ius happening on the 2.X branch. That being said, please see my comments inline below On Thu, Sep 17, 2020 at 7:45 AM wrote: > > I am using Nutch 2.4 with Hadoop 3.1.1 To the best of my

Re: Nutch 1.17 download available?

2020-06-07 Thread Lewis John McGibbney
Hi Jim, Response below On 2020/06/06 14:23:24, Jim Anderson wrote: > > I cannot find a download for Nutch 1.17. Is Nutch 1.17 available for > download? If so, can someone please give me a pointer. > Nutch 1.17 is current master branch e.g. in development, meaning that there is no official

Re: [DISCUSS] Release 1.17 ?

2020-04-23 Thread lewis john mcgibbney
Hi Seb, Go for it. I’ll happily review. Excellent work folks... really excellent work. lewismc On Wed, Apr 22, 2020 at 23:27 wrote: > > user Digest 23 Apr 2020 06:27:46 - Issue 3055 > > Topics (messages 34517 through 34517) > > [DISCUSS] Release 1.17 ? > 34517 by: Sebastian Nagel >

[SECURITY] Nutch 2.3.1 affected by downstream dependency CVE-2016-6809

2019-10-14 Thread lewis john mcgibbney
Title: Nutch 2.3.1 affected by downstream dependency CVE-2016-6809 Vulnerable Versions: 2.3.1 (1.16 is not vulnerable) Disclosure date: 2018-10-22 Credit: Pierre Ernst, Salesforce Summary: Remote Code Execution in Apache Nutch 2.3.1 when crawling web site containing malicious content

Re: [VOTE] Release Apache Nutch 2.4 RC#1

2019-10-01 Thread lewis john mcgibbney
I've opened >https://issues.apache.org/jira/browse/NUTCH-2741 > to remove it. > > Best, > Sebastian > > On 28.09.19 17:54, lewis john mcgibbney wrote: > > Hi Seb, > > > > On Thu, Sep 26, 2019 at 4:37 AM > wrote: > > > >> From: Sebastian Na

Re: [VOTE] Release Apache Nutch 2.4 RC#1

2019-09-28 Thread lewis john mcgibbney
Hi Seb, On Thu, Sep 26, 2019 at 4:37 AM wrote: > From: Sebastian Nagel > To: user@nutch.apache.org > Cc: d...@nutch.apache.org > Bcc: > Date: Tue, 24 Sep 2019 11:54:48 +0200 > Subject: [VOTE] Release Apache Nutch 2.4 RC#1 > Hi Folks, > > A first candidate for the Nutch 2.4 release is available

Re: Injection from webservice

2019-09-19 Thread lewis john mcgibbney
Hi Folks, I've implemented what Dave suggested... it is clean and easy but it maybe not quite as ad-hoc-capable as one would always want. For my use cases it was acceptable. More responses inline On Thu, Sep 19, 2019 at 2:47 PM wrote: > From: Jorge Betancourt > To: user@nutch.apache.org > Cc:

Mavenize Nutch Build as Google Summer of Code

2019-03-09 Thread lewis john mcgibbney
Hi user@ and dev@, If you are a student and would like to tackle the task of Mavenizing the Nutch master build please get in touch with me here, directly or comment on the following issue https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-2292 Thank you Lewis --

Re: mapred.child.java.opts

2018-12-08 Thread Lewis John McGibbney
Hi Hany, Yes the paramater is set to 1GB by default but it should also be noted that this configuration key is actually deprecated as of some time ago. Seeing as we are using the 'new' MapReduce API, I suspect we should use 'mapreduce.map.java.opts` and `mapreduce.reduce.java.opts` instead so

Re: [ask] Crawl Forum Site

2018-12-03 Thread lewis john mcgibbney
Hi Tukang, In short yes. It would help if you could provide an example of what you've tried and what you encountered/what your results were. Lewis On Mon, Dec 3, 2018 at 6:42 PM wrote: > > From: tkg_cangkul > To: user@nutch.apache.org > Cc: > Bcc: > Date: Tue, 04 Dec 2018 09:40:47 +0700 >

Re: Apache Nutch vs Multiple elasticsearch nodes

2018-11-28 Thread lewis john mcgibbney
Hi Marcello, I don't think this is correct no! first however, I really suggest that we upgrade the Jest client in this plugin. The most recent one is 6.3.1 and we are using 2.0.3. Please see https://issues.apache.org/jira/browse/NUTCH-2677, if you are able to provide a patch and test it out it

Re: webapp for Nutch deploy mode

2018-10-18 Thread Lewis John McGibbney
Hi Gahanna, Response inline On 2018/10/12 07:40:50, Gajanan Watkar wrote: > Hi all, > I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of > Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to > solr-6.5.1. > I want to use *webapp* for creating, controlling

Re: Unable to get regex-urlfilter working

2018-10-11 Thread lewis john mcgibbney
Hi Gajanan, Seeing as you are using 2.x, are you making sure that the project has been built with the correct regex-urlfilter.txt being present on ClassPath and included in the job jar you are using? On Thu, Oct 11, 2018 at 12:19 AM wrote: > > > From: Gajanan Watkar > To:

Uneven HBase region sizes WAS Re: Nodemanager crashing repeatedly

2018-09-19 Thread lewis john mcgibbney
Hi Gajanan, CC dev@gora, this is something we may wish to implement within HBase. If anything I've provided below is incorrect, then please correct the record. BTW, I found the following article written by Elis, to be extremely useful

Re: Nodemanager crashing repeatedly

2018-09-06 Thread lewis john mcgibbney
Hi Gajanan, Which OS are you running this on? I would also suggest that if you want to use the 2.x codebase, you should use the most recent from SCM e.g. check out master and change to 2.x branch. Finally, for now at least, you didn't mention the phase at which the crawl is failing. Can you

Re: redirect bin/crwal log output to some other file

2018-09-06 Thread lewis john mcgibbney
Hi Amarnatha, There are a couple of options which I can think of. 1. Why don't you just set up a simple daemon to watch hadoop.log and generate a subsequent stream writing it to /tmp/myurls.log e.g. tail -f hadoop.log > /tmp/myurls.log 2. Check out confirmation/log4j.properties, you will see the

Re: IndexWriter interface in 1.15

2018-09-06 Thread lewis john mcgibbney
Hi Yossi, REASON: Upgrade of MapReduce API from legacy to 'new'. This was a breaking change for sure and a HUGE patch. We did not however factor in the non-braking aspects of the upgrade... so it has not all been plain sailing. PROPOSED SOLUTION: I tend to agree with you that this should be addd

Re: Nutch Maven support for plugins

2018-08-29 Thread lewis john mcgibbney
Hi Rustam, There have been some efforts to Mavenize the entire build system. These all died. If you look on JIRA you will see the relevant tickets for the most recent implementation https://issues.apache.org/jira/browse/NUTCH-2292 Our current build does not publish the Nutch plugins are Maven

Re: Nutch 2.3.1 with Mongo datastore - No Document is getting indexed.

2018-08-16 Thread lewis john mcgibbney
Hi Puneet Responses inline On Wed, Aug 15, 2018 at 7:20 AM wrote: > > From: Puneet Dhanda > To: user@nutch.apache.org > Cc: > Bcc: > Date: Wed, 15 Aug 2018 10:02:12 -0400 > Subject: Nutch 2.3.1 with Mongo datastore - No Document is getting indexed. > Hi, > > I am using the Nutch- 2.3.1 with

[RESULT] was [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-07 Thread lewis john mcgibbney
Excellent. Thanks for taking on release manager Seb, it’s making a huge impact. Nice work folks. On Tue, Aug 7, 2018 at 05:37 wrote: > > user Digest 7 Aug 2018 12:37:25 - Issue 2921 > > Topics (messages 34124 through 34124) > > [RESULT] was [VOTE] Release Apache Nutch 1.15 RC#1 >

Re: [ANNOUNCE] New Nutch committer and PMC - Omkar Reddy

2018-06-21 Thread lewis john mcgibbney
Excellent. Good job Omkar. On Thu, Jun 21, 2018 at 1:18 AM, wrote: > > > From: Sebastian Nagel > To: d...@nutch.apache.org > Cc: user@nutch.apache.org > Bcc: > Date: Thu, 21 Jun 2018 10:18:01 +0200 > Subject: [ANNOUNCE] New Nutch committer and PMC - Omkar Reddy > Dear all, > > it is my

Re: No internet connection in Nutch crawler: Proxy configuration -PAC file

2018-04-23 Thread lewis john mcgibbney
Hi Patricia, I've never used a proxy auto-config (PAC) method for proxying anything before. The PAC is defined as "...Proxy auto-configuration (PAC): Specify the URL for a PAC file with a JavaScript function that determines the appropriate proxy for each URL. This method is more suitable for

Re: Nutch fetching times out at 3 hours, not sure why.

2018-04-18 Thread lewis john mcgibbney
Hi Chip, Which version of Nutch are you using? On Tue, Apr 17, 2018 at 7:45 AM, wrote: > From: Chip Calhoun > To: "user@nutch.apache.org" > Cc: > Bcc: > Date: Tue, 17 Apr 2018 14:45:01 + > Subject: Nutch fetching

Re: any23 2.2 upgrading in NUTCH gives errors

2018-04-02 Thread lewis john mcgibbney
Hi Govind, Please scope out https://github.com/apache/nutch/pull/306 Let me know how things go. Lewis On Mon, Apr 2, 2018 at 4:45 AM, wrote: > > > From: govind nitk > To: user@nutch.apache.org > Cc: > Bcc: > Date: Mon, 2 Apr 2018

Re: index-metadata, lowercasing field names?

2018-03-07 Thread lewis john mcgibbney
Patch it Markus. On Wed, Mar 7, 2018 at 1:58 PM, wrote: > > From: Markus Jelsma > To: User > Cc: > Bcc: > Date: Wed, 7 Mar 2018 11:24:09 + > Subject: index-metadata, lowercasing field names? > Hi, > >

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

2018-02-12 Thread lewis john mcgibbney
utch from master (rather than just copy a > few jar files) and ensure that any23’s jar dependencies as also references.. > > > On Feb 9, 2018, at 1:45 PM, Lewis John McGibbney <lewi...@apache.org> > wrote: > > > > Hi David, > > We are in the process of releasin

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

2018-02-09 Thread Lewis John McGibbney
Comma-separated list of Any23 extractors (a list of > extractors is available here: > http://any23.apache.org/getting-started.html) > > > I expected to see additional information from nutch parsechecker after adding > the jsonld extractors, however I see NO changes t

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

2018-02-08 Thread lewis john mcgibbney
Hi David, Answers inline On Thu, Feb 8, 2018 at 9:19 AM, wrote: > > From: David Ferrero > To: user@nutch.apache.org > Cc: > Bcc: > Date: Thu, 8 Feb 2018 10:19:52 -0700 > Subject: NUTCH-1129, Any23, microdata parsing, indexing, and

Re: Can I use protocol-selenium with https?

2018-01-15 Thread lewis john mcgibbney
Hi Sheon, It looks like HTTPS is not currently supported https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java#L63 I can't recall if we were ever successful in adapting the plugin for HTTPS so I can't advise further.

Re: Getting Error

2018-01-11 Thread lewis john mcgibbney
k <govind.n...@gmail.com> > wrote: > > > > > hi Lewis, > > > > uname -a: Linux data 4.4.0-108-generic #131~14.04.1-Ubuntu SMP Sun Jan 7 > > 15:54:10 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > > > On Tue, Jan 9, 2018 at 7:56 PM, lewis john mcgibbney

Re: Getting Error

2018-01-09 Thread lewis john mcgibbney
Hi govind, Very strange. Which operating system are you using? Lewis On Tue, Jan 9, 2018 at 5:15 AM, wrote: > From: govind nitk > To: user@nutch.apache.org > Cc: > Bcc: > Date: Tue, 9 Jan 2018 15:45:08 +0530 > Subject: Getting Error >

Re: upgrading Selenium is causing errors

2018-01-03 Thread lewis john mcgibbney
Hi Sheon, Assuming that you are using Nutch master branch, please read https://github.com/apache/nutch/blob/master/src/plugin/lib-selenium/howto_upgrade_selenium.txt https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium Make the relevant dependency updates and repackage the

Re: Nutch 2.x does not send index to ElasticSearch 2.3.3

2017-12-26 Thread lewis john mcgibbney
Hi Devil, Do your logs indicate any issues? Lewis On Mon, Dec 25, 2017 at 5:41 PM, wrote: > > -- Forwarded message -- > From: devil devil > To: user@nutch.apache.org > Cc: > Bcc: > Date: Fri, 22 Dec 2017 21:24:51 +0100 >

[ANNOUNCE] Apache Gora 0.8 Release

2017-09-20 Thread lewis john mcgibbney
Hi Folks, The Apache Gora team are pleased to announce the immediate availability of Apache Gora 0.8. The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to - column stores, - key value stores, - document stores,

Request for Review

2017-09-06 Thread lewis john mcgibbney
Hi user@ and dev@, As part of the Nutch Google Summer of Code effort this year, Omkar Reddy and I have been working persistently throughout the summer months on the Hadoop MapReduce API upgrade e.g. NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce [0].

Re: nutch server with different configs

2017-08-14 Thread lewis john mcgibbney
Hi Raziyeh, Please see https://wiki.apache.org/nutch/NutchRESTAPI#Configuration Once you've created your new config, you can use it as follows https://wiki.apache.org/nutch/NutchRESTAPI#Create_job Lewis On Fri, Aug 11, 2017 at 12:23 AM, wrote: > > From:

Re: I'm just going to throw this out there...

2017-08-14 Thread lewis john mcgibbney
Hi Ray, Apart from not being able to find a tutorial, what is wrong exactly? New users of Nutch are advised to use the Nutch 1.X series. The Nutch 2.X tutorial introduces more moving parts. This is well documented on this mailing list for a number of years now. If you can enumerate what is wrong,

Re: nutch 1.x tutorial with solr 6.6.0

2017-07-12 Thread lewis john mcgibbney
Hi Folks, I just updated the tutorial below, if you find any discrepancies please let me know. https://wiki.apache.org/nutch/NutchTutorial Also, I have made available a new schema.xml which is compatible with Solr 6.6.0 at https://issues.apache.org/jira/browse/NUTCH-2400 Please scope it out

Re: nutch 1.x tutorial with solr 6.6.0

2017-07-09 Thread lewis john mcgibbney
Hi Pau, On Sat, Jul 8, 2017 at 6:52 AM, wrote: > From: Pau Paches > To: user@nutch.apache.org > Cc: > Bcc: > Date: Sat, 8 Jul 2017 15:52:46 +0200 > Subject: nutch 1.x tutorial with solr 6.6.0 > Hi, > I have run the Nutch 1.x

Re: Custom Plugin Resources Files

2017-06-29 Thread lewis john mcgibbney
Hi Dave, Does this need to be done in parsing phase? Parsing is already an IO intensive process... could you possible do it at another phase? Right now, the only plugin I can think of which ships with Nutch source, and which consults an external resource (not packaged with Nutch) is the

Re: ERROR: Cannot run job worker!

2017-06-24 Thread lewis john mcgibbney
Hi Vyacheslav, Thanks for the update, can you please open a ticket at https://issues.apache.org/jira/projects/NUTCH If you are able to submit a pull request at https://github.com/apache/nutch/, it would be appreciated. Lewis On Sat, Jun 24, 2017 at 9:36 AM,

Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-22 Thread lewis john mcgibbney
Hi Vyacheslav, Can you provide me and example page with http refresh tag included? I'll try comparing behaviour between 2.X and master. Thank you Lewis On Sat, Jun 17, 2017 at 9:25 AM, wrote: > From: Vyacheslav Pascarel > To:

Re: ERROR: Cannot run job worker!

2017-06-21 Thread lewis john mcgibbney
Hi Vyacheslav, Which version of Nutch are you using? 2.x? lewis On Wed, Jun 21, 2017 at 10:32 AM, wrote: > > > From: Vyacheslav Pascarel > To: "user@nutch.apache.org" > Cc: > Bcc: > Date: Wed, 21 Jun 2017

Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-15 Thread lewis john mcgibbney
Hi Vyacheslav, On Thu, Jun 15, 2017 at 1:41 AM, wrote: > > From: Vyacheslav Pascarel > To: "user@nutch.apache.org" > Cc: > Bcc: > Date: Wed, 14 Jun 2017 22:15:49 + > Subject: Outlinks field is not populated

Re: Optimize Nutch Indexing Speed

2017-06-15 Thread lewis john mcgibbney
Hi Dennis, On Thu, Jun 15, 2017 at 1:41 AM, wrote: > > From: Dennis A > To: user@nutch.apache.org > Cc: > Bcc: > Date: Wed, 14 Jun 2017 20:45:35 +0200 > Subject: Re: Optimize Nutch Indexing Speed > Hi Lewis, > thank you for your

Re: Optimize Nutch Indexing Speed

2017-06-14 Thread lewis john mcgibbney
Hi Dennis, On Sun, Jun 11, 2017 at 2:45 AM, wrote: > > From: Dennis A > To: user@nutch.apache.org > Cc: > Bcc: > Date: Fri, 9 Jun 2017 09:59:05 +0200 > Subject: Optimize Nutch Indexing Speed > Hello, > I have recently configured my

Re: Many indexers

2017-06-14 Thread lewis john mcgibbney
Hi Roannel, Markus worked on this quite a bit a while back. Please see https://issues.apache.org/jira/browse/NUTCH-1480. If you were able to pick this back up and update the patch with a pull request I would happily review it, test it and provide feedback. On Wed, Jun 14, 2017 at 7:42 AM,

[ANNOUNCEMENT] Welcome Blackice as new Nutch PMC and Committer

2017-06-14 Thread lewis john mcgibbney
Hi Folks, The Nutch PMC recently VOTE'd in Blackice to formally join our Nutch Project Management Committee and as a Project Committer. Please join me in offering a friendly welcome... not that he needs it He's been here for quite a while :) @Blackice, feel free to say a bit about yourself if you

RE: What up with 2.3.1 ?

2017-06-05 Thread lewis john mcgibbney
Forwarding with correct thread name. -- Forwarded message -- From: lewis john mcgibbney <lewi...@apache.org> Date: Mon, Jun 5, 2017 at 2:50 PM Subject: Re: user Digest 3 Jun 2017 19:27:20 - Issue 2758 To: "user@nutch.apache.org" <user@nutch.apache.org>

Re: user Digest 3 Jun 2017 19:27:20 -0000 Issue 2758

2017-06-05 Thread lewis john mcgibbney
Hi Ed, Disappointing to hear that this really got under your skin... never nice to hear that frustration becomes the outcome rather than successfully running the software. I've provided comments below On Sat, Jun 3, 2017 at 12:27 PM, wrote: > > From: Edward

Re: user Digest 17 Apr 2017 22:31:08 -0000 Issue 2738

2017-04-17 Thread lewis john mcgibbney
Hi Yongyao, The code in question is found below https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L230-L232 A few things come to mind here... * are you sure the entries with a lower score than the minimum threshold were not present before you established

Re: Nutch Plugins Source Control

2017-04-07 Thread lewis john mcgibbney
Hi Folks, Maven build is actually pretty close now. We need to bring the following branch up-to-date with 1.14 then stabilize tests... then it is good to propose as a PR for 1.14-SNAPSHOT. Transferring this work over to 2.x will be much easier than the work done for master branch. I'm over on

[ANNOUNCE] Apache Nutch 1.13 Release

2017-04-02 Thread lewis john mcgibbney
Hello Folks, The Apache Nutch [0] Project Management Committee are pleased to announce the immediate release of Apache Nutch v1.13, we advise all current users and developers of the 1.X series to upgrade to this release. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables

[RESULT] WAS Re: [VOTE] Release Apache Nutch 1.13 RC#1

2017-04-02 Thread lewis john mcgibbney
Hi Folks, Thank you to everyone who was able to review the RC and VOTE, greatly appreciated. 72 has come and gone, please see below for RESULT's. [9] +1 Release this package as Apache Nutch 1.13. Lewis John McGibbney * Julien Nioche * Kevin Ratnasekera Chris A. Mattmann * Furkan KAMACI * Matei

Re: How does scoring chain work

2017-03-29 Thread lewis john mcgibbney
Hi Yongyao, In addition to Seb's response, please also check out the 'scoring.filter.order' property in nutch-site.xml https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1429-L1437 This will determine the order and provide you with more control over complex scoring logic. Lewis

[VOTE] Release Apache Nutch 1.13 RC#1

2017-03-28 Thread lewis john mcgibbney
Hi Folks, A first candidate for the Nutch 1.13 release is available at: https://dist.apache.org/repos/dist/dev/nutch/1.13/ The release candidate is a zip and tar.gz archive of the binary and sources in: https://github.com/apache/nutch/tree/release-1.13 The SHA1 checksum of the archive is

Re: How to configure Apache gora to take only ol as column family ?

2017-03-16 Thread lewis john mcgibbney
Hi suyash, This issue can be addressed by essentially, commenting OUT all of the instances where the WebPage [0] object is augmented within each job (and possibly plugin). An example would be as follows https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/parse/ParseUtil.java#L358

RE: Nutch2 - What are exactly the steps to execute?

2016-11-21 Thread lewis john mcgibbney
Hi Daniele, In short, if I were you I would look into using the readdb resource https://wiki.apache.org/nutch/bin/nutch%20readdb This will enable you to take a peek into your MongoDB table and find out which documents are present. By the looks of it from your Gist nothing is being fetched and

Re: indexing to Solr

2016-11-21 Thread lewis john mcgibbney
Hi Michael, On Sat, Nov 19, 2016 at 8:09 AM, wrote: > From: Michael Coffey > To: "user@nutch.apache.org" > Cc: > Date: Fri, 18 Nov 2016 21:15:14 + (UTC) > Subject: indexing to Solr > Where can I find

Re: how to insert nutch into ambari ecosystem ?

2016-11-15 Thread lewis john mcgibbney
Hi Eyeris, Replies inline On Fri, Oct 28, 2016 at 8:51 PM, wrote: > From: Eyeris Rodriguez Rueda > To: user@nutch.apache.org > Cc: > Date: Fri, 28 Oct 2016 09:43:59 -0400 (CDT) > Subject: how to insert nutch into ambari ecosystem ? > Hi all. >

Re: user Digest 7 Nov 2016 19:53:09 -0000 Issue 2672

2016-11-15 Thread lewis john mcgibbney
Hi Eyeris, I've just tried Nutch master branch to parse outlinks from a number of RSS Feeds, an example being 'http://www.jpl.nasa.gov/blog/feed/'. This works perfectly with both the feed and parse-tika plugins. Outlinks are extracted accordingly. Can you provide an example of the RSS Feeds you

Re: Nutch 2.3.1 REST calls to DB

2016-11-15 Thread lewis john mcgibbney
Hi Vladimir, Responses inline On Thu, Nov 10, 2016 at 1:05 AM, wrote: > From: Vladimir Loubenski > To: "user@nutch.apache.org" > Cc: > Date: Tue, 8 Nov 2016 17:53:59 + > Subject: Nutch 2.3.1 REST calls to DB

Re: How can I Score?

2016-11-15 Thread lewis john mcgibbney
Hi Michael, Replies inline On Sat, Nov 12, 2016 at 7:10 PM, wrote: > From: Michael Coffey > To: "user@nutch.apache.org" > Cc: > Date: Sun, 13 Nov 2016 03:07:16 + (UTC) > Subject: How can I Score? > When

Re: Nutch 2.3.1 elasticsearch tstamp

2016-10-21 Thread lewis john mcgibbney
Hi Joe, On Fri, Oct 21, 2016 at 7:34 AM, wrote: > From: Joe Adams > To: user@nutch.apache.org > Cc: > Date: Fri, 21 Oct 2016 10:34:15 -0400 > Subject: Nutch 2.3.1 elasticsearch tstamp > I'm working on setting up nutch with elasticsearch

Re: I think my hbase is broken

2016-10-21 Thread lewis john mcgibbney
Hi Tom, Please post your entire Nutch log for the inject and generate phase if possible. It is near impossible to debug given the information you've provided. Thanks On Fri, Oct 21, 2016 at 7:34 AM, wrote: > From: Tom Chiverton > To:

Re: Nutch 2, Solr 5 - solrdedup causes ClassCastException:

2016-10-20 Thread lewis john mcgibbney
Hi Tom, This looks like it has been frustrating for you so I've provided a walk through of how I can set up a core using current Nutch 2.X schema.xml On Mon, Oct 17, 2016 at 9:27 AM, wrote: > > From: Tom Chiverton > To:

Re: Nutch in production

2016-10-18 Thread lewis john mcgibbney
Hi Sachin, Answering both of your questions here as I am catching up with some mail. On Fri, Sep 30, 2016 at 5:04 AM, wrote: > > From: Sachin Shaju > To: user@nutch.apache.org > Cc: > Date: Fri, 30 Sep 2016 10:00:04 +0530 > Subject: Re:

  1   2   3   4   5   6   7   8   9   10   >