Hi Thanks for the answer.
I am using version 1.7 of Nutch. And after some tests, it appears that the db.ignore.external.links was the problem. Although it was just a subdomain of the current sites. Is there a way to allow different subdomain while forbidding to go to external link or should I made my own regex? Best Regards, -----Message d'origine----- De : Walter Tietze [mailto:[email protected]] Envoyé : mercredi 16 avril 2014 19:52 À : [email protected]; [email protected] Objet : Re: Don't fetch all urls in a page Hi, there migth be several reasons why it is not working. I think you have to share a bit more of information. Which Nutch version are you using? Can you provide your configuration file? I hope you run at least 2 crawl cycles in trying to get pages for links in your page. Nutch includes newly found urls first into the crawldb and therefore can fetch these links at the earliest in a second fetch cycle. Inject -> ( URL Select -> URL Partition -> Fetch -> CrawlDb Update ) -> LinkDb Invert -> Index. 1_____________________________________________v Did you set the '-depth' option? Another reason might be your configuration values. Please take a look into the nutch-default.xml. Some interesting values (as of Nutch version 1.5) for your problem are the configured values of .... <name>http.content.limit</name> <name>db.update.additions.allowed</name> <name>db.ignore.internal.links</name> <name>db.ignore.external.links</name> <name>db.max.outlinks.per.page</name> ..... for example! These values can change the behaviour of your crawler. With them you can regulate if links should be inserted into the crawldb or not. Is the configured limit of the fetched content length enough for your page? If you have a very large page the links might be near the end of the document and therefore won't be found. Have you configured some other url filter? I, for instance, am also using the filter for domain definitions in 'domain-urlfilter.txt' to delimit my search space to pages from special domains. All configuration values have a description field which should help you to understand their intention. Please check these values first and, if necessary, give some more information! Cheers, Walter Am 16.04.2014 18:48, schrieb Zabini: > Hi, > > I am facing a problem with the urls nutch fetch. > > I have a page and whithin several URLs, but Nucth does not fetch them. > They are allowed in the regex-urlfilter and those URLs works fine if I > put them in my urls seed list. > > Does anyone has any hint on what to do? > > Best Regards, > Zabini > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Don-t-fetch-all-urls-in-a-page-tp41 > 31531.html Sent from the Nutch - User mailing list archive at > Nabble.com. -- -------------------------------- Walter Tietze Senior Software Developer Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin T: +49 30 246 27 318 [email protected] http://www.neofonie.de Handelsregister Berlin-Charlottenburg: HRB 67460 Geschäftsführung Thomas Kitlitschko --------------------------------

