URLFilter Plugin ClassNotFoundExpections

2009-03-09 Thread MyD
Hi @ all, I'd like to write an URLFilter plugin. When I start the crawling process an ClassNotFoundExpections is thrown. Below you will find my code / settings. Would be great if you can help me further. Thanks in advance. === $NUTCH_HOME/conf/nutch-site.xml === property

Pulling out URLs

2009-03-11 Thread MyD
Hi @ all, I started to write my own plugin. I extended the HtmlParserFilter to grab outlinks to other pages, but it looks like that the outlinks are just links to css or js files, or am I wrong? What is the best way to extract all outlinks to a url that is not in the domain MY.DOMAIN.NAME? You

Re: Pulling out URLs

2009-03-12 Thread MyD
Thank you for the hint. How can this be done with the Segment Reader (Nutch 0.9 api)? Thanks in advance. Cheers, MyD vishal vachhani wrote: Simple solution would be done the segments using following command and just write a script which can extract the Outlinks present in the documents

Limit Nutch Crawl to Seed URLs

2009-03-13 Thread MyD
Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I have 1000 seed URLs and I want to crawl just this domains. Thanks in advance. Regards, MyD -- View this message in context: http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html Sent

Re: Limit Nutch Crawl to Seed URLs

2009-03-13 Thread MyD
Where can I find the domain urlfilter? I'm using the branch 0.9... Cheers, Markus Dennis Kubes-2 wrote: There is a domain-urlfilter that should help do what you are looking for. Dennis MyD wrote: Hi @ all, is it possible to limit nutchs crawling process to the seed URLs? E.g. I

synchronized File Writer

2009-03-15 Thread MyD
. Thanks in advance. Regards, MyD -- View this message in context: http://www.nabble.com/synchronized-File-Writer-tp22531603p22531603.html Sent from the Nutch - User mailing list archive at Nabble.com.

Implementing a custom SAX / DOM parser

2009-03-16 Thread MyD
Hi @ all, I'd like to know if it is possible to implement his own sax parser for a plugin and where this could be done e.g. at which extension point. Thanks in advance. Cheers, MyD -- View this message in context: http://www.nabble.com/Implementing-a-custom-SAX---DOM-parser

Re: embed nutch crawl in an application

2009-03-17 Thread MyD
This is an interesting question. If you know how to run the Crawl process out of another Java program, plz let me know it. Thanks in advance. n_developer wrote: Generally nutch crawl in done thru cygwin. If i dont want to run cygwin, and i want to crawl an application from an application

Where to put plugin specific parameters / configurations

2009-03-18 Thread MyD
Hi @ all, where is it possible to set plugin (my own plugin) specific parameters / configurations? Thanks in advance. Regards, MyD -- View this message in context: http://www.nabble.com/Where-to-put-plugin-specific-parameters---configurations-tp22577145p22577145.html Sent from the Nutch

Nutch 1.0 trunk Fetch Schedule

2009-03-18 Thread MyD
Hi @ all, is it possible to set the next fetch schedule for a url in another crawl dir? Example: crawl.dir.A - retrieve links and set the fetch schedule but this should go into the crawl.dir.B crawl.dir.B Thanks in advance Regards, MyD -- View this message in context: http

Re: Nutch 1.0 trunk Fetch Schedule

2009-03-18 Thread MyD
Hi ripper, Thanks, do u know how to do it in java? I tried to, but haven't found the suitable classes. Thanks in advance. Cheers, MyD ripper07 wrote: well you can always write a bash script or a java class that does this. writing a java class is probably better and easier. you have

Nutch doesn't find all urls.. Any suggestion?

2009-03-19 Thread MyD
? nutch-site.xml property nameplugin.includes/name valuemy-plugin|protocol-http|parse-(html|js)|index-basic/value description /description /property I commented all urlfilter files (regex etc..) in conf/. Thanks in advance. Regards, MyD -- View this message in context: http

Configuration files

2009-03-24 Thread MyD
hi @ all, I have 2 plugins and I'd like to have for each plugin a nutch-site.xml configuration. How can this be done? Thanks in advance. Cheers, MyD -- View this message in context: http://www.nabble.com/Configuration-files-tp22675581p22675581.html Sent from the Nutch - User mailing list

URL Scoring

2009-04-24 Thread MyD
. Regards, MyD -- View this message in context: http://www.nabble.com/URL-Scoring-tp23211894p23211894.html Sent from the Nutch - User mailing list archive at Nabble.com.