On Thu, Mar 24, 2011 at 1:46 PM, Gabriele Kahlout <[email protected]>wrote:
> > > On Thu, Mar 24, 2011 at 1:36 PM, McGibbney, Lewis John < > [email protected]> wrote: > >> Hi Gabriele, >> >> Out of curiosity, how large is your crawl job? How many URL's are you >> fetching on each increment. Is it a continuous crawl job? >> > > I guess the -topN 1 triggered your interest. I was fetching only one local > page out of testing. Now I'm testing to crawl simple wikipedia with -topN > 100. I'm also trying to figure out wherether my $3 represents the depth of > crawls or not. > It's for sure if all the urls <= -topN, but when doing what I'm trying > (incremental crawling) I'd like all urls injected to be fetched, in topN > increments, rather than start fetch urls found in the previous iteration > topN urls. > It indeed is this way. I'guess my options would be: 1. use a scoring plugin that assigns a lower score to links that the initial score, so that urls from the urls list are retrieved first using -topN than links added to the db after fetching. My understanding is that the OpicScoringFilter right now assigns 0 to start with and so all urls are equal and the hashtable works more like a LIFO, hence links are crawled before urls in the list. 2. Include inject in the loop and have the size of the urls in the file == topN such that one iteration is enough for all urls and then inject again. Once the whole list is therefore fetched (with depth=0) one can iterate for depth if desired. I guess this solution is aka merging crawls. I'll be tryin 2. Meanwhile I've changed the script to the attached. > > >> >> Lewis >> ________________________________________ >> From: Gabriele Kahlout [[email protected]] >> Sent: 24 March 2011 12:30 >> To: [email protected] >> Cc: [email protected]; Claudio Martella; [email protected] >> Subject: Re: Index while crawling >> >> This seems to work. >> >> i=0 >> while true; >> do >> if [[ $i -ge $3 ]] >> >> Glasgow Caledonian University is a registered Scottish charity, number >> SC021474 >> >> Winner: Times Higher Education’s Widening Participation Initiative of the >> Year 2009 and Herald Society’s Education Initiative of the Year 2009. >> >> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html >> >> Winner: Times Higher Education’s Outstanding Support for Early Career >> Researchers of the Year 2010, GCU as a lead with Universities Scotland >> partners. >> >> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html >> > > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

