Re: crawling a list of urls

lewis john mcgibbney Thu, 07 Jul 2011 08:52:01 -0700

See comments below

On Thu, Jul 7, 2011 at 4:31 PM, Cam Bazz <[email protected]> wrote:

> Hello Lewis,
>
> Pardon me for the non-verbose desription. I have a set of urls, namely
> product urls, in range of millions.
>

Firstly, (this is juts a suggestion) but I assume that you wish Nutch to
fetch the full page content. Ensure that http.content.limit is set to an
appropriate limit to allow this.

>
> So I want to write my urls, in a flat file, and have nutch crawl them
> to depth = 1
>

As you describe you have various seed directories, therefore I assume that
crawling a large set of seeds will be a recursive task, IMHO I would save
myself the manual task of running the jobs and write a bash script to do
this for me, this will also enable you to schedule for once a day update of
your crawldb, linkdb, solr index and so forth. There are plenty of scripts
which have been tested and used throughout the community here
http://wiki.apache.org/nutch/Archive%20and%20Legacy#Script_Administration

> However, I might remove url's from this list, or add new ones. I also
> would like nutch to revisit each site each 1 day.
>

Check out nutch-site for crawldb fetch intervals, these values can be used
to accommodate the dynamism of various pages. Once you have removed URLs
(this is going to be a laborious and extremely tedious task if done
manually), you would simply run your script again.

I would like removed urls to be deleted, and new ones to be reinjected
> each time nutch starts.
>

With regards to deleting URLs in your crawldb (and subsequently index) I am
not sure of this exactly. Can you justify completely deleting the URLs from
the data store? What happens if you add the URL in again the next day? I',
not sure if this is a sustainable method for maintaining your data
store/index.

>
> Best Regards,
> -C.B.
>
> On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney
> <[email protected]> wrote:
> > Hi C.B.,
> >
> > This is way to vague. We really require more information regarding
> roughly
> > what kind of results you wish to get. It would be a near impossible task
> for
> > anyone to try and specify a solution to this open ended question.
> >
> > Please elaborate
> >
> > Thank you
> >
> > On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz <[email protected]> wrote:
> >
> >> Hello,
> >>
> >> I have a case where I need to crawl a list of exact url's. Somewhere
> >> in the range of 1 to 1.5M urls.
> >>
> >> I have written those urls in numereus files under /home/urls , ie
> >> /home/urls/1 /home/urls/2
> >>
> >> Then by using the crawl command I am crawling to depth=1
> >>
> >> Are there any recomendations or general guidelines that I should
> >> follow when making nutch just to fetch and index a list of urls?
> >>
> >>
> >> Best Regards,
> >> C.B.
> >>
> >
> >
> >
> > --
> > *Lewis*
> >
>

-- 
*Lewis*

Re: crawling a list of urls

Reply via email to