On Wed, May 4, 2011 at 6:22 PM, Kelvin <[email protected]> wrote: > Hi Gabriele, > > Thank you for your help. I am sorry, I am a newbie to nutch. If I crawl the > whole wikipedia, the whole wikipedia will be stored in the crawldb ofmy > server? >
i think so (I'm also a newbie). > > > And this will take up a very big space? > Less than the actual space perhaps, but yes. But you only need to maintain the index and can empty the crawldb. I've posted a script<http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script>which crawls and indexes incrementally. So, you crawl (fetch and parse) 10 pages and then you index them. At that point you don't need the crawldb anymore and can be deleted. In the index, or in the parser, or in between you implement your football selecting logic and select what makes it into the index. This way you crawldb max size is that of 10 pages. > I also need to crawl youtube, to look for videos whose metatags contain > "Football", so this will be very large too? > > > Best regards, > Kelvin > > > > ________________________________ > From: Gabriele Kahlout <[email protected]> > To: [email protected]; Kelvin <[email protected]> > Sent: Wednesday, 4 May 2011 11:34 PM > Subject: Re: Can I custom crawl using Nutch? > > On Wed, May 4, 2011 at 5:20 PM, Kelvin <[email protected]> wrote: > > > Hello, > > > > I would like to crawl wikipedia using Nutch, but as it is too large, I > > would only like to crawl pages that are related to a particular subject. > > > > For example, I would like to crawl for webpages of wikipedia that contain > > the term "Football". Is this possible using Nutch? > > > > how will you know that the page contains football before you fetch, parse > it, and analyze it? > > In the worst-case (and I think also best-case) you will have to fetch all > wikipedia pages. You can choose what index, but not what to crawl (EXPERTS > CORRECT ME), because as I said you need to analyze the whole content page > to > figure out if it cotnains football. > > > > > > > > Thank you for your kind help. > > > > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) > < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

