RE: Don't fetch all urls in a page

Antony Wed, 16 Apr 2014 14:00:43 -0700

Hi 

Thanks for the answer.


I am using version 1.7 of Nutch.
And after some tests, it appears that the db.ignore.external.links was the
problem.
Although it was just a subdomain of the current sites.

Is there a way to allow different subdomain while forbidding to go to
external link or should I made my own regex?

Best Regards,


-----Message d'origine-----
De : Walter Tietze [mailto:[email protected]] 
Envoyé : mercredi 16 avril 2014 19:52
À : [email protected]; [email protected]
Objet : Re: Don't fetch all urls in a page


Hi,


there migth be several reasons why it is not working. I think you have to
share a bit more of information.


Which Nutch version are you using? Can you provide your configuration file?


I  hope you run at least 2 crawl cycles in trying to get pages for links in
your page. Nutch includes newly found urls first into the crawldb and
therefore can fetch these links at the earliest in a second fetch cycle.

Inject -> ( URL Select -> URL Partition -> Fetch -> CrawlDb Update ) ->
LinkDb Invert -> Index.
1_____________________________________________v


Did you set the '-depth' option?



Another reason might be your configuration values. Please take a look into
the nutch-default.xml.


Some interesting values (as of Nutch version 1.5) for your problem are the
configured values of

   ....
   <name>http.content.limit</name>

   <name>db.update.additions.allowed</name>

   <name>db.ignore.internal.links</name>
   <name>db.ignore.external.links</name>
   <name>db.max.outlinks.per.page</name>
   .....

for example!


These values can change the behaviour of your crawler. With them you can
regulate if links should be inserted into the crawldb or not. Is the
configured limit of the fetched content length enough for your page? If you
have a very large page the links might be near the end of the document and
therefore won't be found.


Have you configured some other url filter? I, for instance, am also using
the filter for domain definitions in 'domain-urlfilter.txt'
to delimit my search space to pages from special domains.


All configuration values have a description field which should help you to
understand their intention.


Please check these values first and, if necessary, give some more
information!




Cheers, Walter




Am 16.04.2014 18:48, schrieb Zabini:
> Hi,
>
> I am facing a problem with the urls nutch fetch.
>
> I have a page and whithin several URLs, but Nucth does not fetch them.
> They are allowed in the regex-urlfilter and those URLs works fine if I 
> put them in my urls seed list.
>
> Does anyone has any hint on what to do?
>
> Best Regards,
> Zabini
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Don-t-fetch-all-urls-in-a-page-tp41
> 31531.html Sent from the Nutch - User mailing list archive at 
> Nabble.com.


-- 

--------------------------------
Walter Tietze
Senior Software Developer

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T: +49 30 246 27 318

[email protected]
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460
  
Geschäftsführung
Thomas Kitlitschko
--------------------------------

RE: Don't fetch all urls in a page

Reply via email to