Re: Data Extraction from 100+ different sites...

2013-06-12 Thread Julien Nioche
What I usually do in cases like these is to propagate an identifier from the seeds and use that in the HTMLParsers to determine whether they should process a page. See url-meta plugin for the config to propagate a metadatum from the seeds. This way you don't need to act based on URL patterns but

Running Nutch standalone (without Solr)

2013-06-12 Thread Peter Gaines
Hi There, Is it possible to run Nutch as a standalone crawler without integration with Solr? I need to do this in order to do a performance comparison of it’s raw crawling functionality. It seems like it may be possible using the bin/nutch crawl command but this is now deprecated. Is there

RE: Running Nutch standalone (without Solr)

2013-06-12 Thread Markus Jelsma
Hi, Sure, you don't need to index the data and can use the individual commands or the new bin/crawl script. Cheers -Original message- From:Peter Gaines pgai...@deveire.com Sent: Wed 12-Jun-2013 13:57 To: user@nutch.apache.org Subject: Running Nutch standalone (without Solr)

Re: Running Nutch standalone (without Solr)

2013-06-12 Thread H. Coskun Gunduz
Hi Peter, Yes, it's possible. You'll need a data store (my personal recommendation is HBase). Regarding on the Nutch version you use, you can follow these tutorials: Nutch 1.x: http://wiki.apache.org/nutch/NutchTutorial Nutch 2: http://wiki.apache.org/nutch/Nutch2Tutorial Happy crawling.

Suffix URLFilter not working

2013-06-12 Thread Bai Shen
I'm dealing with a lot of file types that I don't want to index. I was originally using the regex filter to exclude them but it was getting out of hand. I changed my plugin includes from urlfilter-regex to urlfilter-(regex|suffix) I've tried using both the default urlfilter-suffix.txt file

Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Sorry. I forgot to mention that I'm running a 2.x release taken from a few weeks ago. On Wed, Jun 12, 2013 at 8:31 AM, Bai Shen baishen.li...@gmail.com wrote: I'm dealing with a lot of file types that I don't want to index. I was originally using the regex filter to exclude them but it was

RE: Suffix URLFilter not working

2013-06-12 Thread Markus Jelsma
We happily use that filter just as it is shipped with Nutch. Just enabling it in plugin.includes works for us. To ease testing you can use the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters. -Original message- From:Bai Shen baishen.li...@gmail.com Sent: Wed

HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Tony Mullins
Hi , If I go to http://wiki.apache.org/nutch/AboutPlugins ,here it shows me HTMLParseFilter is extension point for adding custom metadata to HTML and its 'Filter' method's signature is 'public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment

Installing Nutch1.6 on Windows7

2013-06-12 Thread Andrea Lanzoni
Hi everyone, I am a newcomer to Nutch and Solr and, after studying literature available on web, I tried to install them on _Windows 7_. I have not been able to match the few instructions on the wikiapache site nor I could find a guide updated to Nutch 1.6 but only for older versions I tried

RE: HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Markus Jelsma
I think for Nutch 2x it was HTMLParseFilter was renamed to ParseFilter. This is not true for 1.x, see NUTCH-1482. https://issues.apache.org/jira/browse/NUTCH-1482 -Original message- From:Tony Mullins tonymullins...@gmail.com Sent: Wed 12-Jun-2013 14:37 To: user@nutch.apache.org

Re: HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Julien Nioche
They are called ParseFilters in 2.x : http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html as they are not limited to processing HTML documents since Tika generates SAX events for other mimetypes J. On 12 June 2013 13:37, Tony Mullins tonymullins...@gmail.com wrote: Hi

Re: HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Tony Mullins
Thank guyz for quick response. If you could point me to any working example of ParseFilter and/or IndexFilter would be great. Regards, Tony On Wed, Jun 12, 2013 at 5:46 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: They are called ParseFilters in 2.x :

Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
I figured as much, which is why I'm not sure why it's not working for me. I ran bin/nutch org.apache.nutch.net.URLFilterChecker http://myserver/myurland it's been thirty minutes with no results. Is there something I should run before running that? Thanks. On Wed, Jun 12, 2013 at 8:34 AM,

Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Doh! I really should just read the code of things before posting. I ran the URLFilterChecker and passed it in a url that the SuffixFilter should flag and it still passed it. However, if I change the url to end in a format that is in the default config file, it rejects the url. So it looks like

Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Turns out it was because I had a copy of the default file sitting in the directory I was calling nutch from. Once I removed that it correctly found my copy in the conf directory. On Wed, Jun 12, 2013 at 9:29 AM, Bai Shen baishen.li...@gmail.com wrote: Doh! I really should just read the code

Re: Suffix URLFilter not working

2013-06-12 Thread Peter Gaines
I have installed version 2.2 of nutch on a CentIOS machine and am using the following command: ./bin/crawl urls testcrawl solrfolder 2 I have attempted to use the default filter configuration and also explicitly specified urlfilter-regex in the nutch-default.xml (without modifying the

IndexWriter Plugin Workflow

2013-06-12 Thread AC Nutch
Hi, I'm writing a custom IndexWriter and I had some questions on the execution workflow. I notice that when I run my index writer plugin the following happens: - the describe String is printed - the .open method is called once - the .write method is called for very NutchDocument - the .close

PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Tony Mullins
Hi, I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it and also the srcpluginbuild.xml successfully. But its .jar file is not being created in my runtimelocalpluginsmyplugin directory. And on running bin/nutch parsechecker http://www.google.nl; I get this error

Re: IndexWriter Plugin Workflow

2013-06-12 Thread Sebastian Nagel
Hi, I'm writing a custom IndexWriter and I had some questions on the execution workflow. Have a look at NUTCH-1527 and NUTCH-1541. I notice that when I run my index writer plugin the following happens: - the describe String is printed - the .open method is called once - the .write

Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Sebastian Nagel
Hi Tony, you have to register your plugin in src/plugin/build.xml Does your src/plugin/myplugin/plugin.xml properly propagate jar file, extension point and implementing class? And, finally, you have to add your plugin to the property plugin.includes in nutch-site.xml Cheers, Sebastian On

Re: Suffix URLFilter not working

2013-06-12 Thread Sebastian Nagel
Hi Peter, please do not hijack threads. Seed URLs must be fully specified including protocol, e.g.: http://nutch.apache.org/ but not apache.org Sebastian On 06/12/2013 05:08 PM, Peter Gaines wrote: I have installed version 2.2 of nutch on a CentIOS machine and am using the following

Re: IndexWriter Plugin Workflow

2013-06-12 Thread AC Nutch
Ah that makes a lot of sense! I will go ahead and open a Jira issue. Thanks for the reply! Alex On Wed, Jun 12, 2013 at 3:50 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, I'm writing a custom IndexWriter and I had some questions on the execution workflow. Have a look at

Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Tejas Patil
Here is the relevant wiki page: http://wiki.apache.org/nutch/WritingPluginExample Although its old, I think that it will help. On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Tony, you have to register your plugin in src/plugin/build.xml Does your