Re: Can I custom crawl using Nutch?

Gabriele Kahlout Wed, 04 May 2011 09:42:56 -0700

On Wed, May 4, 2011 at 6:22 PM, Kelvin <[email protected]> wrote:

> Hi Gabriele,
>
> Thank you for your help. I am sorry, I am a newbie to nutch. If I crawl the
> whole wikipedia, the whole wikipedia will be stored in the crawldb ofmy
> server?
>


i think so (I'm also a newbie).

>
>
> And this will take up a very big space?
>

Less than the actual space perhaps, but yes. But you only need to maintain
the index and can empty the crawldb. I've posted a
script<http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script>which
crawls and indexes incrementally. So, you crawl (fetch and parse) 10
pages and then you index them. At that point you don't need the crawldb
anymore and can be deleted. In the index, or in the parser, or in between
you implement your football selecting logic and select what makes it into
the index.
This way you crawldb max size is that of 10 pages.


> I also need to crawl youtube, to look for videos whose metatags contain
> "Football", so this will be very large too?
>
>
> Best regards,
> Kelvin
>
>
>
> ________________________________
> From: Gabriele Kahlout <[email protected]>
> To: [email protected]; Kelvin <[email protected]>
> Sent: Wednesday, 4 May 2011 11:34 PM
> Subject: Re: Can I custom crawl using Nutch?
>
> On Wed, May 4, 2011 at 5:20 PM, Kelvin <[email protected]> wrote:
>
> > Hello,
> >
> > I would like to crawl wikipedia using Nutch, but as it is too large, I
> > would only like to crawl pages that are related to a particular subject.
> >
> > For example, I would like to crawl for webpages of wikipedia that contain
> > the term "Football". Is this possible using Nutch?
> >
>
> how will you know that the page contains football before you fetch, parse
> it, and analyze it?
>
> In the worst-case (and I think also best-case) you will have to fetch all
> wikipedia pages. You can choose what index, but not what to crawl (EXPERTS
> CORRECT ME), because as I said you need to analyze the whole content page
> to
> figure out if it cotnains football.
>
>
>
>
> >
> > Thank you for your kind help.
> >
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Can I custom crawl using Nutch?

Reply via email to