Thank you Lewis, this has been very illustrative, especially about deleting documents.
Best. On Thu, Jul 7, 2011 at 6:51 PM, lewis john mcgibbney <[email protected]> wrote: > See comments below > > On Thu, Jul 7, 2011 at 4:31 PM, Cam Bazz <[email protected]> wrote: > >> Hello Lewis, >> >> Pardon me for the non-verbose desription. I have a set of urls, namely >> product urls, in range of millions. >> > > Firstly, (this is juts a suggestion) but I assume that you wish Nutch to > fetch the full page content. Ensure that http.content.limit is set to an > appropriate limit to allow this. > > >> >> So I want to write my urls, in a flat file, and have nutch crawl them >> to depth = 1 >> > > As you describe you have various seed directories, therefore I assume that > crawling a large set of seeds will be a recursive task, IMHO I would save > myself the manual task of running the jobs and write a bash script to do > this for me, this will also enable you to schedule for once a day update of > your crawldb, linkdb, solr index and so forth. There are plenty of scripts > which have been tested and used throughout the community here > http://wiki.apache.org/nutch/Archive%20and%20Legacy#Script_Administration > > >> However, I might remove url's from this list, or add new ones. I also >> would like nutch to revisit each site each 1 day. >> > > Check out nutch-site for crawldb fetch intervals, these values can be used > to accommodate the dynamism of various pages. Once you have removed URLs > (this is going to be a laborious and extremely tedious task if done > manually), you would simply run your script again. > > I would like removed urls to be deleted, and new ones to be reinjected >> each time nutch starts. >> > > With regards to deleting URLs in your crawldb (and subsequently index) I am > not sure of this exactly. Can you justify completely deleting the URLs from > the data store? What happens if you add the URL in again the next day? I', > not sure if this is a sustainable method for maintaining your data > store/index. > >> >> Best Regards, >> -C.B. >> >> On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney >> <[email protected]> wrote: >> > Hi C.B., >> > >> > This is way to vague. We really require more information regarding >> roughly >> > what kind of results you wish to get. It would be a near impossible task >> for >> > anyone to try and specify a solution to this open ended question. >> > >> > Please elaborate >> > >> > Thank you >> > >> > On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz <[email protected]> wrote: >> > >> >> Hello, >> >> >> >> I have a case where I need to crawl a list of exact url's. Somewhere >> >> in the range of 1 to 1.5M urls. >> >> >> >> I have written those urls in numereus files under /home/urls , ie >> >> /home/urls/1 /home/urls/2 >> >> >> >> Then by using the crawl command I am crawling to depth=1 >> >> >> >> Are there any recomendations or general guidelines that I should >> >> follow when making nutch just to fetch and index a list of urls? >> >> >> >> >> >> Best Regards, >> >> C.B. >> >> >> > >> > >> > >> > -- >> > *Lewis* >> > >> > > > > -- > *Lewis* >

