Hi all,
Compiled from the sources (JDK11) and ran a small crawl and indexing (to
Solr) both passed with flying colors.
That's a +1 from me. Great work Sebastian!
On Mon, Aug 22, 2022 at 5:30 PM Sebastian Nagel wrote:
> Hi Folks,
>
> A first candidate for the Nutch 1.19 release is available
a good improvement
> for Nutch.
>
> Regards, Roannel
>
> - Original Message -----
> > From: "Jorge Betancourt"
> > To: "user"
> > Sent: Lunes, 16 de Septiembre 2019 13:14:36
> > Subject: [MASSMAIL]Re: Injection from webservice
>
>
Hi Roannel,
The current implementation of the injector only accepts a path (actually an
org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
directly unless you download the content first.
If you use the REST API you can send the seed file using the API endpoint.
Otherwise,
ongPointField.java:154)
> at org.apache.solr.schema.PointField.createFields(PointField.java:250)
> at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:66)
> at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:159)
>
> It seems to be treat
Hi Dave,
Can you check the Solr logs and post the relevant exception?. Also it would
be helpful if you attach the definition of the text field in your Solr
collection.
Best regards,
Jorge
On Tue, Mar 26, 2019 at 9:41 PM Dave Beckstrom
wrote:
> Hi Everyone,
>
> This is probably more of a SOLR
Hi Hany,
As BlackIce said, there is an open issue on
https://issues.apache.org/jira/browse/NUTCH-585 specifically the
(blacklist_whitelist_plugin) by now I'm not sure (probably not) that the
patch can be applied directly to master, but should provide a good general
idea on how to write a custom
Hi Musshorn,
You can take a look at http://nutch.apache.org/mailing_lists.html on how to
unsubscribe from the mailing list. Send an email to
user-unsubscr...@nutch.apache.org.
Best Regards,
Jorge
On Fri, Sep 28, 2018 at 1:24 PM Musshorn, Kris T CTR USARMY CECOM (US) <
If I understand correctly, what you want is to index/store the URL where
the PDF link was found right? The name of the website we don't track (by
default). But you could do this (sort of) using the index-links plugin (
https://github.com/apache/nutch/tree/master/src/plugin/index-links).
This will
Welcome on board Roannel! Great to have you here!
Best Regards,
On Wed, Jun 27, 2018 at 9:17 AM Semyon Semyonov
wrote:
> Hi Roannel,
>
> Congratulations and good luck!
>
> Semyon.
>
>
> Sent: Wednesday, June 27, 2018 at 3:42 AM
> From: "Roannel Fernández Hernández"
> To: user@nutch.apache.org
Welcome on board Omkar!
Best regards,
Jorge
Usually we tend to develop everything inside the Nutch file structure,
specially useful if you need to deploy to a Hadoop cluster later on
(because you need to bundle everything in a job file).
But, if you really want to develop the plugin in isolation you only need to
create a new project in
Is there any reason why writing a `HtmlParseFilter` would not be enough?
The HTML parser will execute its own logic and provide a DOM representation
to all the filters and you can extract your own data from the DOM tree.
At the moment individual parsers are matched by mimetype (see
Great news!
Thanks Sebastian!
Best regards,
Jorge
On Dec 25, 2017, 4:20 PM -0500, Hasan Diwan , wrote:
> Congrats to all involved!
>
> On 25 December 2017 at 13:14, Markus Jelsma wrote:
>
> > Thanks Sebastian!
> >
> >
> >
> > -Original
Hello Rushikesh,
Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
could use the Tika boilerpipe implementation, on the nutch-site.xml you
need to enable this feature with:
tika.extractor
boilerpipe
Which text extraction algorithm to use. Valid values are:
If you only want to avoid **indexing** old documents you could write your
own `IndexingFilter` that will check your condition and avoid the indexing
of the documents. You don't mention your Nutch version, but assuming that
you're using v1 we have a new PR (https://github.com/apache/nutch/pull/219)
From the logs looks like the error is coming from the Solr side, do you
mind checking/sharing the logs on your Solr server? Can you pin point which
URL is causing the issue?
Best Regards, Jorge
On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey
wrote:
Does anybody
Hi Zoltán,
You can take a look at [1] in there you could find some documentation,
although it says that was updated to version 1.8, we do not change the
extension points that often. You can also take a look at the code [2]
related to the plugin subsystem. It is true that the documentation is not
gt;
> If you dont mind can you point me to a specific plugin that does something
> similar?
>
> On Thu, Jun 29, 2017 at 8:39 AM, Jorge Betancourt <
> betancourt.jo...@gmail.com> wrote:
>
> > Hi Dave,
> >
> > My advice would be to leave your resources out
Hi Dave,
My advice would be to leave your resources out of the plugins, if there is
a configuration file (or additional files), just load what you need from
the conf directory if the files dictionary can change just make it
configurable on the nutch-site.xml.
Best Regards,
Jorge
PS: You can
19 matches
Mail list logo