On Thu, Mar 24, 2011 at 4:33 PM, Gabriele Kahlout <[email protected]>wrote:
> > > On Thu, Mar 24, 2011 at 1:46 PM, Gabriele Kahlout < > [email protected]> wrote: > >> >> >> On Thu, Mar 24, 2011 at 1:36 PM, McGibbney, Lewis John < >> [email protected]> wrote: >> >>> Hi Gabriele, >>> >>> Out of curiosity, how large is your crawl job? How many URL's are you >>> fetching on each increment. Is it a continuous crawl job? >>> >> >> I guess the -topN 1 triggered your interest. I was fetching only one local >> page out of testing. Now I'm testing to crawl simple wikipedia with -topN >> 100. I'm also trying to figure out wherether my $3 represents the depth of >> crawls or not. >> It's for sure if all the urls <= -topN, but when doing what I'm trying >> (incremental crawling) I'd like all urls injected to be fetched, in topN >> increments, rather than start fetch urls found in the previous iteration >> topN urls. >> > > It indeed is this way. I'guess my options would be: > > 1. use a scoring plugin that assigns a lower score to links that the > initial score, so that urls from the urls list are retrieved first using > -topN than links added to the db after fetching. My understanding is that > the OpicScoringFilter right now assigns 0 to start with and so all urls are > equal and the hashtable works more like a LIFO, hence links are crawled > before urls in the list. > > Essentially I seconded the thoughts of Julien and Ken > here<http://search-lucene.com/m/Fi4T8jJiQS&subj=Re+How+to+prioritize+the+fetching+of+outlinks+> . My objection to this approach however is that one modifies the score of a page just to inflence nutch fetching speed/priority, while it has nothing to do with that page's 'effective' score. 2. Include inject in the loop and have the size of the urls in the file == > topN such that one iteration is enough for all urls and then inject again. > Once the whole list is therefore fetched (with depth=0) one can iterate for > depth if desired. I guess this solution is aka merging crawls. > > I'll be tryin 2. Meanwhile I've changed the script to the attached. > > I've had an issue <https://issues.apache.org/jira/browse/NUTCH-971>merging a > merged index with another index. Other than using the patch the workaround is to append part-1 to the output index: $ bin/nutch merge crawl/temp_indexes/*part-1* crawl/indexes crawl/new_indexes I'll contribute the script to the wiki once done with it. > >> >> >>> >>> Lewis >>> ________________________________________ >>> From: Gabriele Kahlout [[email protected]] >>> Sent: 24 March 2011 12:30 >>> To: [email protected] >>> Cc: [email protected]; Claudio Martella; [email protected] >>> Subject: Re: Index while crawling >>> >>> This seems to work. >>> >>> i=0 >>> while true; >>> do >>> if [[ $i -ge $3 ]] >>> >>> Glasgow Caledonian University is a registered Scottish charity, number >>> SC021474 >>> >>> Winner: Times Higher Education’s Widening Participation Initiative of the >>> Year 2009 and Herald Society’s Education Initiative of the Year 2009. >>> >>> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html >>> >>> Winner: Times Higher Education’s Outstanding Support for Early Career >>> Researchers of the Year 2010, GCU as a lead with Universities Scotland >>> partners. >>> >>> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html >>> >> >> >> >> -- >> Regards, >> K. Gabriele >> >> --- unchanged since 20/9/10 --- >> P.S. If the subject contains "[LON]" or the addressee acknowledges the >> receipt within 48 hours then I don't resend the email. >> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ >> time(x) < Now + 48h) ⇒ ¬resend(I, this). >> >> If an email is sent by a sender that is not a trusted contact or the email >> does not contain a valid code then the email is not received. A valid code >> starts with a hyphen and ends with "X". >> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ >> L(-[a-z]+[0-9]X)). >> >> > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

