Hi @ all,
I'd like to write an URLFilter plugin. When I start the crawling process an
ClassNotFoundExpections is thrown. Below you will find my code / settings.
Would be great if you can help me further. Thanks in advance.
=== $NUTCH_HOME/conf/nutch-site.xml ===
property
Hi @ all,
I started to write my own plugin. I extended the HtmlParserFilter to grab
outlinks to other pages, but it looks like that the outlinks are just links
to css or js files, or am I wrong? What is the best way to extract all
outlinks to a url that is not in the domain MY.DOMAIN.NAME? You
Thank you for the hint. How can this be done with the Segment Reader (Nutch
0.9 api)? Thanks in advance.
Cheers,
MyD
vishal vachhani wrote:
Simple solution would be done the segments using following command and
just
write a script which can extract the Outlinks present in the documents
Hi @ all,
is it possible to limit nutchs crawling process to the seed URLs? E.g. I
have 1000 seed URLs and I want to crawl just this domains. Thanks in
advance.
Regards,
MyD
--
View this message in context:
http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html
Sent
Where can I find the domain urlfilter? I'm using the branch 0.9...
Cheers,
Markus
Dennis Kubes-2 wrote:
There is a domain-urlfilter that should help do what you are looking for.
Dennis
MyD wrote:
Hi @ all,
is it possible to limit nutchs crawling process to the seed URLs? E.g. I
. Thanks in advance.
Regards,
MyD
--
View this message in context:
http://www.nabble.com/synchronized-File-Writer-tp22531603p22531603.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Hi @ all,
I'd like to know if it is possible to implement his own sax parser for a
plugin and where this could be done e.g. at which extension point. Thanks in
advance.
Cheers,
MyD
--
View this message in context:
http://www.nabble.com/Implementing-a-custom-SAX---DOM-parser
This is an interesting question. If you know how to run the Crawl process out
of another Java program, plz let me know it. Thanks in advance.
n_developer wrote:
Generally nutch crawl in done thru cygwin. If i dont want to run cygwin,
and i want to crawl an application from an application
Hi @ all,
where is it possible to set plugin (my own plugin) specific parameters /
configurations? Thanks in advance.
Regards,
MyD
--
View this message in context:
http://www.nabble.com/Where-to-put-plugin-specific-parameters---configurations-tp22577145p22577145.html
Sent from the Nutch
Hi @ all,
is it possible to set the next fetch schedule for a url in another crawl
dir?
Example:
crawl.dir.A
- retrieve links and set the fetch schedule but this should go into the
crawl.dir.B
crawl.dir.B
Thanks in advance
Regards,
MyD
--
View this message in context:
http
Hi ripper,
Thanks, do u know how to do it in java? I tried to, but haven't found the
suitable classes. Thanks in advance.
Cheers,
MyD
ripper07 wrote:
well you can always write a bash script or a java class that does
this. writing a java class is probably better and easier. you have
?
nutch-site.xml
property
nameplugin.includes/name
valuemy-plugin|protocol-http|parse-(html|js)|index-basic/value
description
/description
/property
I commented all urlfilter files (regex etc..) in conf/.
Thanks in advance.
Regards,
MyD
--
View this message in context:
http
hi @ all,
I have 2 plugins and I'd like to have for each plugin a nutch-site.xml
configuration. How can this be done? Thanks in advance.
Cheers,
MyD
--
View this message in context:
http://www.nabble.com/Configuration-files-tp22675581p22675581.html
Sent from the Nutch - User mailing list
.
Regards,
MyD
--
View this message in context:
http://www.nabble.com/URL-Scoring-tp23211894p23211894.html
Sent from the Nutch - User mailing list archive at Nabble.com.
14 matches
Mail list logo