On Wed, May 4, 2011 at 5:20 PM, Kelvin <[email protected]> wrote:

> Hello,
>
> I would like to crawl wikipedia using Nutch, but as it is too large, I
> would only like to crawl pages that are related to a particular subject.
>
> For example, I would like to crawl for webpages of wikipedia that contain
> the term "Football". Is this possible using Nutch?
>

how will you know that the page contains football before you fetch, parse
it, and analyze it?

In the worst-case (and I think also best-case) you will have to fetch all
wikipedia pages. You can choose what index, but not what to crawl (EXPERTS
CORRECT ME), because as I said you need to analyze the whole content page to
figure out if it cotnains football.




>
> Thank you for your kind help.
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Reply via email to