Re: Bullentin board crawling

Neal Richter Fri, 18 Jun 2010 19:20:44 -0700

Do you have an RSS feed of all 700 of the pages you can use as the
input (first page)? Or just generate the list youself and give them
all as inputs?


On 6/18/10, eric park <[email protected]> wrote:
> Hello, Alex
>
> Thank you for your help. The problem is that I cannot set crawler depth to
> 70.  It will take forever crawling unnecessary web pages.  The bulletin
> board I'm trying to crawl is divided into 70 pages, each page containing 10
> pages.  When I set the crawler depth to 5 and crawl, it crawls only 50
> pages.  I'm looking for a way to crawl a bulletin board containing 700 pages
> recursively without setting the crawler depth to 70.  I would appreciate any
> ideas or help.
>
> Thank you.
>
> 2010/6/18 Alex McLintock <[email protected]>
>
>> Hello Eric,
>>
>> I'm not sure I see your problem. There is nothing special about a
>> bulletin board compared to any other website. Here are some ideas
>> which may help?
>>
>> Have you iterated the "generate list of urls, crawl them, index them"
>> stage or have you only run it once?
>>
>> By default Nutch will ignore URLs with "?" in them as this introduces
>> a program parameter. This is generally a wise thing to do, but you can
>> override it by examing the regular expressions used to filter pages.
>> (find the regex config files)
>>
>>
>> On 18 June 2010 06:29, eric park <[email protected]> wrote:
>> > Hi, I'm trying to crawl a bulletin board containing about 700 pages. I
>> set
>> > the nutch crawler depth to 5, ran the crawler and only crawled about 60
>> > pages. I don't think Nutch  crawls bulletin board recursively.  Anyone
>> found
>> > a way to crawl a bulletin board recursively?
>> >
>> > Thank you
>> >
>>
>

Re: Bullentin board crawling

Reply via email to