Hi,
I am crawling some sites using nutch. My requirement is, when i run a nutch
crawl, then somehow it should be able to reuse the data in webdb populated
in previous crawl.
In other words my question is suppose my crawl is running and i cancel it
somewhere in middle, then is there someway i can
It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run
Hi Stefan,
Thanks for lightening fast reply. I was amazed to see such quick response
really appreciate it.
Actually what i am really looking is, suppose i run a crawl for sometime
sites say 5 and for some depth say 2. Then what i want is next time i run a
crawl it should re use the webdb
Still do not clearly understand you plans, sorry. However pages from
the webdb are recrawled every 30 days (but configurable in the nutch-
default.xml).
The new folder are so called segments and you can put it to the trash
after 30 days.
So what you can do is first never updated your webdb
Actually i wanted to reuse the processings i do in a particular crawl for
future crawls so as to avoid downloading pages which are not of my interest.
Here is an example:
1. Suppose i am crawling http://www.abc.com website.
2. Then this gets injected in webdb and Fetchlist tool populates
About this blocking you can try to use the urlfilters, change the
filter between each fetch/generate
+^http://www.abc.com
-^http://www.bbc.co.uk
Pushpesh Kr. Rajwanshi wrote:
Oh this is pretty good and quite helpful material i wanted. Thanks Havard
for this. Seems like this will help me
hmmm... actually my requirement is a bit more complex than it seems so url
filters alone probably would do. Because i am not filtering urls based only
on some domain name but within domain i want to discard some urls, and since
they actually dont follow a pattern hence i cant use url filters
Pushpesh,
We extended nutch with a whitelist filter and you might find it useful.
Check the comments from Matt Kangas here:
http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all
--Flo
Pushpesh Kr. Rajwanshi wrote:
hmmm... actually my requirement is