Re: crawling a list of urls

Cam Bazz Thu, 07 Jul 2011 15:03:34 -0700

Thank you Lewis, this has been very illustrative, especially about
deleting documents.


Best.

On Thu, Jul 7, 2011 at 6:51 PM, lewis john mcgibbney
<[email protected]> wrote:
> See comments below
>
> On Thu, Jul 7, 2011 at 4:31 PM, Cam Bazz <[email protected]> wrote:
>
>> Hello Lewis,
>>
>> Pardon me for the non-verbose desription. I have a set of urls, namely
>> product urls, in range of millions.
>>
>
> Firstly, (this is juts a suggestion) but I assume that you wish Nutch to
> fetch the full page content. Ensure that http.content.limit is set to an
> appropriate limit to allow this.
>
>
>>
>> So I want to write my urls, in a flat file, and have nutch crawl them
>> to depth = 1
>>
>
> As you describe you have various seed directories, therefore I assume that
> crawling a large set of seeds will be a recursive task, IMHO I would save
> myself the manual task of running the jobs and write a bash script to do
> this for me, this will also enable you to schedule for once a day update of
> your crawldb, linkdb, solr index and so forth. There are plenty of scripts
> which have been tested and used throughout the community here
> http://wiki.apache.org/nutch/Archive%20and%20Legacy#Script_Administration
>
>
>> However, I might remove url's from this list, or add new ones. I also
>> would like nutch to revisit each site each 1 day.
>>
>
> Check out nutch-site for crawldb fetch intervals, these values can be used
> to accommodate the dynamism of various pages. Once you have removed URLs
> (this is going to be a laborious and extremely tedious task if done
> manually), you would simply run your script again.
>
> I would like removed urls to be deleted, and new ones to be reinjected
>> each time nutch starts.
>>
>
> With regards to deleting URLs in your crawldb (and subsequently index) I am
> not sure of this exactly. Can you justify completely deleting the URLs from
> the data store? What happens if you add the URL in again the next day? I',
> not sure if this is a sustainable method for maintaining your data
> store/index.
>
>>
>> Best Regards,
>> -C.B.
>>
>> On Thu, Jul 7, 2011 at 6:21 PM, lewis john mcgibbney
>> <[email protected]> wrote:
>> > Hi C.B.,
>> >
>> > This is way to vague. We really require more information regarding
>> roughly
>> > what kind of results you wish to get. It would be a near impossible task
>> for
>> > anyone to try and specify a solution to this open ended question.
>> >
>> > Please elaborate
>> >
>> > Thank you
>> >
>> > On Thu, Jul 7, 2011 at 12:56 PM, Cam Bazz <[email protected]> wrote:
>> >
>> >> Hello,
>> >>
>> >> I have a case where I need to crawl a list of exact url's. Somewhere
>> >> in the range of 1 to 1.5M urls.
>> >>
>> >> I have written those urls in numereus files under /home/urls , ie
>> >> /home/urls/1 /home/urls/2
>> >>
>> >> Then by using the crawl command I am crawling to depth=1
>> >>
>> >> Are there any recomendations or general guidelines that I should
>> >> follow when making nutch just to fetch and index a list of urls?
>> >>
>> >>
>> >> Best Regards,
>> >> C.B.
>> >>
>> >
>> >
>> >
>> > --
>> > *Lewis*
>> >
>>
>
>
>
> --
> *Lewis*
>

Re: crawling a list of urls

Reply via email to