Hey Sebastian,
thanks. What I did so far is: delete the database and start a whole new crawl. I saw that jira with orphaned pages, before. That is exactly, what I'm looking for: as the ticket is more than 2 years old, I assume it won't be fixed.. :-( Thanks David -----Ursprüngliche Nachricht----- Von: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Gesendet: Freitag, 28. Juli 2017 12:09 An: user@nutch.apache.org Betreff: Re: Crawling with nutch, check Links Hi David, the easiest way is to delete the CrawlDb and to start the crawl from scratch. Since it's a site crawl this should be possible, at least, from time to time. Then delete documents from the index which haven't been updated. A more sophisticated solution is not yet ready, see https://issues.apache.org/jira/browse/NUTCH-1932 Best, Sebastian On 07/27/2017 10:11 AM, d.ku...@technisat.de wrote: > Hey, > > currently I'm working on nutch with solr for our company pages. > > Assuming the following situation: > We have a website: > > www.mysite.lol<http://www.mysite.lol> > > at this site there is a Link: > www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1 > 564/> > > As you can see there is a type I should be /testpage/: > > www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512 > -1564/> > > As our Framework doesn't care about the text before the ID, we could type > everything we want and the site will be displayed because of the id. That is > why both link are fine and there is no 404. > If I change the link from the mainpage to the correct one, let nutch crawl > the site again, an send is to solr, the old one is still found. > > So the link > www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1 > 564/> is still at the nutch db, because the link is valid --> no 404. > But there is no mainpage pointing to this website. How do I tell nutch to > ignore sites, which doesn't have a link to it. > Basically --> revalidating links and removing site without links to it? > > > > Mit freundlichen Grüßen > David Kumar > > Senior Software Engineer Java, B. Sc. > Projektmanager PIM > Abteilung Infotech > TechniSat Digital GmbH > Julius-Saxler-Straße 3 > TechniPark > D-54550 Daun / Germany > > Tel.: + 49 (0) 6592 / 712 -2826 > Fax: + 49 (0) 6592 / 712 -2829 > > www.technisat.com/de_DE/<http://www.technisat.com/de_DE/> > www.facebook.com/technisat > >