Thanks Gabriele, It has been my intention to get hands on and begin working on a reliable script which includes recently added commands etc.
This is a starting point for me Thank you Lewis ________________________________________ From: Gabriele Kahlout [[email protected]] Sent: 27 March 2011 15:24 To: [email protected] Cc: McGibbney, Lewis John Subject: Re: Index while crawling On Fri, Mar 25, 2011 at 3:44 PM, McGibbney, Lewis John < [email protected]> wrote: > Hi Gabriele, > > Would it be worth making this script available on the wiki with an > explanation of exactly what it's purpose is, what it does, and a use case. > > When I get a chance I will try it out using Solr as indexing mechanism. > I've posted the script [1]. It's set up so that it works with Solr, but haven't yet posted a Hadoop edition, since I still need to familiarize with it. [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script > Thank you for this > > Lewis > ________________________________________ > From: Gabriele Kahlout [[email protected]] > Sent: 24 March 2011 15:33 > To: [email protected] > Cc: McGibbney, Lewis John; [email protected] > Subject: Re: Index while crawling > > It indeed is this way. I'guess my options would be: > > 1. use a scoring plugin that assigns a lower score to links that the > initial score, so that urls from the urls list are retrieved first using > -topN than links added to the db after fetching. My understanding is that > the OpicScoringFilter right now assigns 0 to start with and so all urls are > equal and the hashtable works more like a LIFO, hence links are crawled > before urls in the list. > > 2. Include inject in the loop and have the size of the urls in the file == > topN such that one iteration is enough for all urls and then inject again. > Once the whole list is therefore fetched (with depth=0) one can iterate for > depth if desired. I guess this solution is aka merging crawls. > > I'll be tryin 2. Meanwhile I've changed the script to the attached. > > > Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > > Winner: Times Higher Education’s Widening Participation Initiative of the > Year 2009 and Herald Society’s Education Initiative of the Year 2009. > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > > Winner: Times Higher Education’s Outstanding Support for Early Career > Researchers of the Year 2010, GCU as a lead with Universities Scotland > partners. > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

