Hi Gabriele,

Thank you for your help. I am sorry, I am a newbie to nutch. If I crawl the 
whole wikipedia, the whole wikipedia will be stored in the crawldb ofmy server? 


And this will take up a very big space?

I also need to crawl youtube, to look for videos whose metatags contain 
"Football", so this will be very large too?


Best regards,
Kelvin



________________________________
From: Gabriele Kahlout <[email protected]>
To: [email protected]; Kelvin <[email protected]>
Sent: Wednesday, 4 May 2011 11:34 PM
Subject: Re: Can I custom crawl using Nutch?

On Wed, May 4, 2011 at 5:20 PM, Kelvin <[email protected]> wrote:

> Hello,
>
> I would like to crawl wikipedia using Nutch, but as it is too large, I
> would only like to crawl pages that are related to a particular subject.
>
> For example, I would like to crawl for webpages of wikipedia that contain
> the term "Football". Is this possible using Nutch?
>

how will you know that the page contains football before you fetch, parse
it, and analyze it?

In the worst-case (and I think also best-case) you will have to fetch all
wikipedia pages. You can choose what index, but not what to crawl (EXPERTS
CORRECT ME), because as I said you need to analyze the whole content page to
figure out if it cotnains football.




>
> Thank you for your kind help.
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Reply via email to