is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can

Re: is nutch recrawl possible?

2005-12-19 Thread Stefan Groschupf
It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run

Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb

Re: is nutch recrawl possible?

2005-12-19 Thread Stefan Groschupf
Still do not clearly understand you plans, sorry. However pages from the webdb are recrawled every 30 days (but configurable in the nutch- default.xml). The new folder are so called segments and you can put it to the trash after 30 days. So what you can do is first never updated your webdb

Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Actually i wanted to reuse the processings i do in a particular crawl for future crawls so as to avoid downloading pages which are not of my interest. Here is an example: 1. Suppose i am crawling http://www.abc.com website. 2. Then this gets injected in webdb and Fetchlist tool populates

Re: is nutch recrawl possible?

2005-12-19 Thread Håvard W. Kongsgård
About this blocking you can try to use the urlfilters, change the filter between each fetch/generate +^http://www.abc.com -^http://www.bbc.co.uk Pushpesh Kr. Rajwanshi wrote: Oh this is pretty good and quite helpful material i wanted. Thanks Havard for this. Seems like this will help me

Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
hmmm... actually my requirement is a bit more complex than it seems so url filters alone probably would do. Because i am not filtering urls based only on some domain name but within domain i want to discard some urls, and since they actually dont follow a pattern hence i cant use url filters

Re: is nutch recrawl possible?

2005-12-19 Thread Florent Gluck
Pushpesh, We extended nutch with a whitelist filter and you might find it useful. Check the comments from Matt Kangas here: http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all --Flo Pushpesh Kr. Rajwanshi wrote: hmmm... actually my requirement is