What I usually do in cases like these is to propagate an identifier from
the seeds and use that in the HTMLParsers to determine whether they should
process a page. See url-meta plugin for the config to propagate a metadatum
from the seeds. This way you don't need to act based on URL patterns but
Hi There,
Is it possible to run Nutch as a standalone crawler without integration with
Solr?
I need to do this in order to do a performance comparison of it’s raw crawling
functionality.
It seems like it may be possible using the bin/nutch crawl command but this is
now deprecated.
Is there
Hi,
Sure, you don't need to index the data and can use the individual commands or
the new bin/crawl script.
Cheers
-Original message-
From:Peter Gaines pgai...@deveire.com
Sent: Wed 12-Jun-2013 13:57
To: user@nutch.apache.org
Subject: Running Nutch standalone (without Solr)
Hi Peter,
Yes, it's possible.
You'll need a data store (my personal recommendation is HBase).
Regarding on the Nutch version you use, you can follow these tutorials:
Nutch 1.x: http://wiki.apache.org/nutch/NutchTutorial
Nutch 2: http://wiki.apache.org/nutch/Nutch2Tutorial
Happy crawling.
I'm dealing with a lot of file types that I don't want to index. I was
originally using the regex filter to exclude them but it was getting out of
hand.
I changed my plugin includes from
urlfilter-regex
to
urlfilter-(regex|suffix)
I've tried using both the default urlfilter-suffix.txt file
Sorry. I forgot to mention that I'm running a 2.x release taken from a few
weeks ago.
On Wed, Jun 12, 2013 at 8:31 AM, Bai Shen baishen.li...@gmail.com wrote:
I'm dealing with a lot of file types that I don't want to index. I was
originally using the regex filter to exclude them but it was
We happily use that filter just as it is shipped with Nutch. Just enabling it
in plugin.includes works for us. To ease testing you can use the bin/nutch
org.apache.nutch.net.URLFilterChecker to test filters.
-Original message-
From:Bai Shen baishen.li...@gmail.com
Sent: Wed
Hi ,
If I go to http://wiki.apache.org/nutch/AboutPlugins ,here it shows me
HTMLParseFilter is extension point for adding custom metadata to HTML and
its 'Filter' method's signature is 'public ParseResult filter(Content
content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment
Hi everyone, I am a newcomer to Nutch and Solr and, after studying
literature available on web, I tried to install them on _Windows 7_.
I have not been able to match the few instructions on the wikiapache
site nor I could find a guide updated to Nutch 1.6 but only for older
versions
I tried
I think for Nutch 2x it was HTMLParseFilter was renamed to ParseFilter. This is
not true for 1.x, see NUTCH-1482.
https://issues.apache.org/jira/browse/NUTCH-1482
-Original message-
From:Tony Mullins tonymullins...@gmail.com
Sent: Wed 12-Jun-2013 14:37
To: user@nutch.apache.org
They are called ParseFilters in 2.x :
http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html
as they are not limited to processing HTML documents since Tika generates
SAX events for other mimetypes
J.
On 12 June 2013 13:37, Tony Mullins tonymullins...@gmail.com wrote:
Hi
Thank guyz for quick response.
If you could point me to any working example of ParseFilter and/or
IndexFilter would be great.
Regards,
Tony
On Wed, Jun 12, 2013 at 5:46 PM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
They are called ParseFilters in 2.x :
I figured as much, which is why I'm not sure why it's not working for me.
I ran bin/nutch org.apache.nutch.net.URLFilterChecker
http://myserver/myurland it's been thirty minutes with no results.
Is there something I should run before running that?
Thanks.
On Wed, Jun 12, 2013 at 8:34 AM,
Doh! I really should just read the code of things before posting.
I ran the URLFilterChecker and passed it in a url that the SuffixFilter
should flag and it still passed it. However, if I change the url to end in
a format that is in the default config file, it rejects the url.
So it looks like
Turns out it was because I had a copy of the default file sitting in the
directory I was calling nutch from.
Once I removed that it correctly found my copy in the conf directory.
On Wed, Jun 12, 2013 at 9:29 AM, Bai Shen baishen.li...@gmail.com wrote:
Doh! I really should just read the code
I have installed version 2.2 of nutch on a CentIOS machine and am using the
following command:
./bin/crawl urls testcrawl solrfolder 2
I have attempted to use the default filter configuration and also explicitly
specified urlfilter-regex
in the nutch-default.xml (without modifying the
Hi,
I'm writing a custom IndexWriter and I had some questions on the execution
workflow.
I notice that when I run my index writer plugin the following happens:
- the describe String is printed
- the .open method is called once
- the .write method is called for very NutchDocument
- the .close
Hi,
I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it and
also the srcpluginbuild.xml successfully. But its .jar file is not being
created in my runtimelocalpluginsmyplugin directory.
And on running
bin/nutch parsechecker http://www.google.nl;
I get this error
Hi,
I'm writing a custom IndexWriter and I had some questions on the execution
workflow.
Have a look at NUTCH-1527 and NUTCH-1541.
I notice that when I run my index writer plugin the following happens:
- the describe String is printed
- the .open method is called once
- the .write
Hi Tony,
you have to register your plugin in
src/plugin/build.xml
Does your
src/plugin/myplugin/plugin.xml
properly propagate jar file,
extension point and implementing class?
And, finally, you have to add your plugin
to the property plugin.includes in nutch-site.xml
Cheers,
Sebastian
On
Hi Peter,
please do not hijack threads.
Seed URLs must be fully specified including protocol, e.g.:
http://nutch.apache.org/
but not
apache.org
Sebastian
On 06/12/2013 05:08 PM, Peter Gaines wrote:
I have installed version 2.2 of nutch on a CentIOS machine and am using the
following
Ah that makes a lot of sense! I will go ahead and open a Jira issue. Thanks
for the reply!
Alex
On Wed, Jun 12, 2013 at 3:50 PM, Sebastian Nagel wastl.na...@googlemail.com
wrote:
Hi,
I'm writing a custom IndexWriter and I had some questions on the
execution
workflow.
Have a look at
Here is the relevant wiki page:
http://wiki.apache.org/nutch/WritingPluginExample
Although its old, I think that it will help.
On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel wastl.na...@googlemail.com
wrote:
Hi Tony,
you have to register your plugin in
src/plugin/build.xml
Does your
23 matches
Mail list logo