Hi Gabriele, Thank you for your help. I am sorry, I am a newbie to nutch. If I crawl the whole wikipedia, the whole wikipedia will be stored in the crawldb ofmy server?
And this will take up a very big space? I also need to crawl youtube, to look for videos whose metatags contain "Football", so this will be very large too? Best regards, Kelvin ________________________________ From: Gabriele Kahlout <[email protected]> To: [email protected]; Kelvin <[email protected]> Sent: Wednesday, 4 May 2011 11:34 PM Subject: Re: Can I custom crawl using Nutch? On Wed, May 4, 2011 at 5:20 PM, Kelvin <[email protected]> wrote: > Hello, > > I would like to crawl wikipedia using Nutch, but as it is too large, I > would only like to crawl pages that are related to a particular subject. > > For example, I would like to crawl for webpages of wikipedia that contain > the term "Football". Is this possible using Nutch? > how will you know that the page contains football before you fetch, parse it, and analyze it? In the worst-case (and I think also best-case) you will have to fetch all wikipedia pages. You can choose what index, but not what to crawl (EXPERTS CORRECT ME), because as I said you need to analyze the whole content page to figure out if it cotnains football. > > Thank you for your kind help. > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

