Re: crawling a website

remi tassing Mon, 02 Apr 2012 02:40:54 -0700

It depends on the structure of your site and you can modify
"regex-urlfilter.txt" to reach your goal.


>From the examples you gave, you can do this:
*"- ^http://ww.mywebsite.com/[^/]*$"*
it will exclude  http://ww.mywebsite.com/alpha, http://ww.mywebsite.com/beta
, http://ww.mywebsite.com/gamma

*"- ^http://ww.mywebsite.com/.*/$"*
This will exclude any URL that ends with "/"

I would suggest you get familiar with regular expressions (in case you
don't yet)

Remi

On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi <
alessio.crisant...@gmail.com> wrote:

> Dear All,
> I would change my crawling operation but I don't know how can I do.
>
> crawling my website I used the follow command:
>
> $ bin/nutch crawl urls -solr http://localhost:8983/solr -threads 3 -depth
> 35 -topN 10
>
> for crawl with nutch and index results on solr index.
>
>
>
> But I would not crawl the single section of my website but only the single
> pages.
>
> for example:
>
> You considere a site: www.mywebsite.com composed with 3 section:
>
> http://ww.mywebsite.com/alpha
>
> http://ww.mywebsite.com/beta
>
> http://ww.mywebsite.com/gamma
>
>
>
> so, I want between my results, only the single pages of my articles, and
> not the list of articles on this directories also.
>
> So, I would for example, the parsong of the file:
>
> http://ww.mywebsite.com/alpha/artcle1.html
>
> http://ww.mywebsite.com/alpha/artcle3.html
>
> ...
>
>
>
> and i don't want the parsing of the parent section:
>
> http://ww.mywebsite.com/alpha/
>
>
>
> How can I do?
>
> suggestion?
>
> sorry if not all clear
>
> thank you
>
> alessio
>

Re: crawling a website

Reply via email to