Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-31 Thread Jorge Betancourt
Hi all, Compiled from the sources (JDK11) and ran a small crawl and indexing (to Solr) both passed with flying colors. That's a +1 from me. Great work Sebastian! On Mon, Aug 22, 2022 at 5:30 PM Sebastian Nagel wrote: > Hi Folks, > > A first candidate for the Nutch 1.19 release is available

Re: [MASSMAIL]Re: Injection from webservice

2019-09-17 Thread Jorge Betancourt
a good improvement > for Nutch. > > Regards, Roannel > > - Original Message ----- > > From: "Jorge Betancourt" > > To: "user" > > Sent: Lunes, 16 de Septiembre 2019 13:14:36 > > Subject: [MASSMAIL]Re: Injection from webservice > >

Re: Injection from webservice

2019-09-16 Thread Jorge Betancourt
Hi Roannel, The current implementation of the injector only accepts a path (actually an org.apache.hadoop.fs.Path) this means that there is no way to feed an URL directly unless you download the content first. If you use the REST API you can send the seed file using the API endpoint. Otherwise,

Re: Nutch failing on SOLR text field

2019-03-27 Thread Jorge Betancourt
ongPointField.java:154) > at org.apache.solr.schema.PointField.createFields(PointField.java:250) > at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:66) > at > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:159) > > It seems to be treat

Re: Nutch failing on SOLR text field

2019-03-26 Thread Jorge Betancourt
Hi Dave, Can you check the Solr logs and post the relevant exception?. Also it would be helpful if you attach the definition of the text field in your Solr collection. Best regards, Jorge On Tue, Mar 26, 2019 at 9:41 PM Dave Beckstrom wrote: > Hi Everyone, > > This is probably more of a SOLR

Re: Block certain parts of HTML code from being indexed

2018-11-16 Thread Jorge Betancourt
Hi Hany, As BlackIce said, there is an open issue on https://issues.apache.org/jira/browse/NUTCH-585 specifically the (blacklist_whitelist_plugin) by now I'm not sure (probably not) that the patch can be applied directly to master, but should provide a good general idea on how to write a custom

Re: [Non-DoD Source] Re: Include parent URL in pdf data - nutch

2018-09-28 Thread Jorge Betancourt
Hi Musshorn, You can take a look at http://nutch.apache.org/mailing_lists.html on how to unsubscribe from the mailing list. Send an email to user-unsubscr...@nutch.apache.org. Best Regards, Jorge On Fri, Sep 28, 2018 at 1:24 PM Musshorn, Kris T CTR USARMY CECOM (US) <

Re: Include parent URL in pdf data - nutch

2018-09-28 Thread Jorge Betancourt
If I understand correctly, what you want is to index/store the URL where the PDF link was found right? The name of the website we don't track (by default). But you could do this (sort of) using the index-links plugin ( https://github.com/apache/nutch/tree/master/src/plugin/index-links). This will

Re: [MASSMAIL][ANNOUNCE] New Nutch committer and PMC -

2018-06-27 Thread Jorge Betancourt
Welcome on board Roannel! Great to have you here! Best Regards, On Wed, Jun 27, 2018 at 9:17 AM Semyon Semyonov wrote: > Hi Roannel, > > Congratulations and good luck! > > Semyon. > > > Sent: Wednesday, June 27, 2018 at 3:42 AM > From: "Roannel Fernández Hernández" > To: user@nutch.apache.org

Re: [ANNOUNCE] New Nutch committer and PMC - Omkar Reddy

2018-06-22 Thread Jorge Betancourt
Welcome on board Omkar! Best regards, Jorge

Re: Having plugin as a separate project

2018-05-04 Thread Jorge Betancourt
Usually we tend to develop everything inside the Nutch file structure, specially useful if you need to deploy to a Hadoop cluster later on (because you need to bundle everything in a job file). But, if you really want to develop the plugin in isolation you only need to create a new project in

Re: RE: Dependency between plugins

2018-03-14 Thread Jorge Betancourt
Is there any reason why writing a `HtmlParseFilter` would not be enough? The HTML parser will execute its own logic and provide a DOM representation to all the filters and you can extract your own data from the DOM tree. At the moment individual parsers are matched by mimetype (see

Re: [ANNOUNCE] Apache Nutch 1.14 Release

2017-12-25 Thread Jorge Betancourt
Great news! Thanks Sebastian! Best regards, Jorge On Dec 25, 2017, 4:20 PM -0500, Hasan Diwan , wrote: > Congrats to all involved! > > On 25 December 2017 at 13:14, Markus Jelsma wrote: > > > Thanks Sebastian! > > > > > > > > -Original

Re: Removing header,Footer and left menus while crawling

2017-11-14 Thread Jorge Betancourt
Hello Rushikesh, Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you could use the Tika boilerpipe implementation, on the nutch-site.xml you need to enable this feature with: tika.extractor boilerpipe Which text extraction algorithm to use. Valid values are:

Re: Index URL's based on a condition

2017-09-26 Thread Jorge Betancourt
If you only want to avoid **indexing** old documents you could write your own `IndexingFilter` that will check your condition and avoid the indexing of the documents. You don't mention your Nutch version, but assuming that you're using v1 we have a new PR (https://github.com/apache/nutch/pull/219)

Re: invalid utf8 chars when indexing or cleaning

2017-08-29 Thread Jorge Betancourt
From the logs looks like the error is coming from the Solr side, do you mind checking/sharing the logs on your Solr server? Can you pin point which URL is causing the issue? Best Regards, Jorge On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey wrote: Does anybody

Re: After Parse extension point

2017-07-27 Thread Jorge Betancourt
Hi Zoltán, You can take a look at [1] in there you could find some documentation, although it says that was updated to version 1.8, we do not change the extension points that often. You can also take a look at the code [2] related to the plugin subsystem. It is true that the documentation is not

Re: Custom Plugin Resources Files

2017-06-29 Thread Jorge Betancourt
gt; > If you dont mind can you point me to a specific plugin that does something > similar? > > On Thu, Jun 29, 2017 at 8:39 AM, Jorge Betancourt < > betancourt.jo...@gmail.com> wrote: > > > Hi Dave, > > > > My advice would be to leave your resources out

Re: Custom Plugin Resources Files

2017-06-29 Thread Jorge Betancourt
Hi Dave, My advice would be to leave your resources out of the plugins, if there is a configuration file (or additional files), just load what you need from the conf directory if the files dictionary can change just make it configurable on the nutch-site.xml. Best Regards, Jorge PS: You can