is nutch recrawl possible?
Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh
Re: is nutch recrawl possible?
It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh
Re: is nutch recrawl possible?
Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh
Re: is nutch recrawl possible?
Still do not clearly understand you plans, sorry. However pages from the webdb are recrawled every 30 days (but configurable in the nutch- default.xml). The new folder are so called segments and you can put it to the trash after 30 days. So what you can do is first never updated your webdb with the fetched segment, that will not add new urls, or alternative use a url filter. You will find a lot of posts in the mail archive regarding this issues. Stefan Am 19.12.2005 um 15:18 schrieb Pushpesh Kr. Rajwanshi: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh
Re: is nutch recrawl possible?
Actually i wanted to reuse the processings i do in a particular crawl for future crawls so as to avoid downloading pages which are not of my interest. Here is an example: 1. Suppose i am crawling http://www.abc.com website. 2. Then this gets injected in webdb and Fetchlist tool populates fetchlist in segment dir from webdb. 3. Then Fetcher creates FetcherThreads which download the content of this page. 4. Now once i download any page then i analyse the page and may be want to mark this page as blocked (as i find it useless for me) and store this information persistently so when i do a crawl next time for same site it remembers that i blocked it and hence it skips downloading this url. So basically its like this. I run a crawl and suppose out of total 100 pages i mark 60 pages as blocked then after this crawl finishes, i run same crawl again but this time i want those 60 urls not to be downloaded since i marked them as blocked. Actually my question can i do this somewhere in nutch? May be i assign very low or zero score to these urls and make my cutt off score above then this. But problem with crawl is everytime i do it, it requires the directory should not be already created and hence my prev data cant be used. But i think as you suggested me those steps i think those seem valuable and may be i will have to write my own CrawlTool to make it behave as i really need it so i think i got the clue and just need to work it out. Thanks for the valuable info and your precious time. Hope i am clearer this time :-) Regards Pushpesh For example suppose i crawl a website www.abc.com then i find some links in it and then suppose i assign my own score (i've done code changes for this already) to the urls found in www.abc.com site before fetching the contents On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Still do not clearly understand you plans, sorry. However pages from the webdb are recrawled every 30 days (but configurable in the nutch- default.xml). The new folder are so called segments and you can put it to the trash after 30 days. So what you can do is first never updated your webdb with the fetched segment, that will not add new urls, or alternative use a url filter. You will find a lot of posts in the mail archive regarding this issues. Stefan Am 19.12.2005 um 15:18 schrieb Pushpesh Kr. Rajwanshi: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh
Re: is nutch recrawl possible?
About this blocking you can try to use the urlfilters, change the filter between each fetch/generate +^http://www.abc.com -^http://www.bbc.co.uk Pushpesh Kr. Rajwanshi wrote: Oh this is pretty good and quite helpful material i wanted. Thanks Havard for this. Seems like this will help me writing code for stuff i need :-) Thanks and Regards, Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Try using the whole-web fetching method instead of the crawl method. http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling http://wiki.media-style.com/display/nutchDocu/quick+tutorial Pushpesh Kr. Rajwanshi wrote: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005 No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005
Re: is nutch recrawl possible?
hmmm... actually my requirement is a bit more complex than it seems so url filters alone probably would do. Because i am not filtering urls based only on some domain name but within domain i want to discard some urls, and since they actually dont follow a pattern hence i cant use url filters otherwise url filters would have done great job. Thanks anyway Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: About this blocking you can try to use the urlfilters, change the filter between each fetch/generate +^http://www.abc.com -^http://www.bbc.co.uk Pushpesh Kr. Rajwanshi wrote: Oh this is pretty good and quite helpful material i wanted. Thanks Havard for this. Seems like this will help me writing code for stuff i need :-) Thanks and Regards, Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Try using the whole-web fetching method instead of the crawl method. http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling http://wiki.media-style.com/display/nutchDocu/quick+tutorial Pushpesh Kr. Rajwanshi wrote: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005 No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005
Re: is nutch recrawl possible?
Pushpesh, We extended nutch with a whitelist filter and you might find it useful. Check the comments from Matt Kangas here: http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all --Flo Pushpesh Kr. Rajwanshi wrote: hmmm... actually my requirement is a bit more complex than it seems so url filters alone probably would do. Because i am not filtering urls based only on some domain name but within domain i want to discard some urls, and since they actually dont follow a pattern hence i cant use url filters otherwise url filters would have done great job. Thanks anyway Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: About this blocking you can try to use the urlfilters, change the filter between each fetch/generate +^http://www.abc.com -^http://www.bbc.co.uk Pushpesh Kr. Rajwanshi wrote: Oh this is pretty good and quite helpful material i wanted. Thanks Havard for this. Seems like this will help me writing code for stuff i need :-) Thanks and Regards, Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Try using the whole-web fetching method instead of the crawl method. http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling http://wiki.media-style.com/display/nutchDocu/quick+tutorial Pushpesh Kr. Rajwanshi wrote: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005 No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005