See comments below On Thu, Jul 7, 2011 at 4:31 PM, Cam Bazz <[email protected]> wrote:
> Hello Lewis, > > Pardon me for the non-verbose desription. I have a set of urls, namely > product urls, in range of millions. > Firstly, (this is juts a suggestion) but I assume that you wish Nutch to fetch the full page content. Ensure that http.content.limit is set to an appropriate limit to allow this. > > So I want to write my urls, in a flat file, and have nutch crawl them > to depth = 1 > As you describe you have various seed directories, therefore I assume that crawling a large set of seeds will be a recursive task, IMHO I would save myself the manual task of running the jobs and write a bash script to do this for me, this will also enable you to schedule for once a day update of your crawldb, linkdb, solr index and so forth. There are plenty of scripts which have been tested and used throughout the community here http://wiki.apache.org/nutch/Archive%20and%20Legacy#Script_Administration > However, I might remove url's from this list, or add new ones. I also > would like nutch to revisit each site each 1 day. > Check out nutch-site for crawldb fetch intervals, these values can be used to accommodate the dynamism of various pages. Once you have removed URLs (this is going to be a laborious and extremely tedious task if done manually), you would simply run your script again. I would like removed urls to be deleted, and new ones to be reinjected > each time nutch starts. > With regards to deleting URLs in your crawldb (and subsequently index) I am not sure of this exactly. Can you justify completely deleting the URLs from the data store? What happens if you add the URL in again the next day? I', not sure if this is a sustainable method for maintaining your data store/index. > > Best Regards, > -C.B. > > On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney > <[email protected]> wrote: > > Hi C.B., > > > > This is way to vague. We really require more information regarding > roughly > > what kind of results you wish to get. It would be a near impossible task > for > > anyone to try and specify a solution to this open ended question. > > > > Please elaborate > > > > Thank you > > > > On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz <[email protected]> wrote: > > > >> Hello, > >> > >> I have a case where I need to crawl a list of exact url's. Somewhere > >> in the range of 1 to 1.5M urls. > >> > >> I have written those urls in numereus files under /home/urls , ie > >> /home/urls/1 /home/urls/2 > >> > >> Then by using the crawl command I am crawling to depth=1 > >> > >> Are there any recomendations or general guidelines that I should > >> follow when making nutch just to fetch and index a list of urls? > >> > >> > >> Best Regards, > >> C.B. > >> > > > > > > > > -- > > *Lewis* > > > -- *Lewis*

