Hey Sebastian,

thanks. What I did so far is: delete the database and start a whole new crawl. 
I saw that jira with orphaned pages, before. That is exactly, what I'm looking 
for: as the ticket is more than 2 years old, I assume it won't be fixed.. :-(

Thanks

David


-----Ursprüngliche Nachricht-----
Von: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Gesendet: Freitag, 28. Juli 2017 12:09
An: user@nutch.apache.org
Betreff: Re: Crawling with nutch, check Links

Hi David,

the easiest way is to delete the CrawlDb and to start the crawl from scratch.
Since it's a site crawl this should be possible, at least, from time to time.
Then delete documents from the index which haven't been updated.

A more sophisticated solution is not yet ready, see
  https://issues.apache.org/jira/browse/NUTCH-1932

Best,
Sebastian

On 07/27/2017 10:11 AM, d.ku...@technisat.de wrote:
> Hey,
> 
> currently I'm working on nutch with solr for our company pages.
> 
> Assuming the following situation:
> We have a website:
> 
> www.mysite.lol<http://www.mysite.lol>
> 
> at this site there is a Link:
> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1
> 564/>
> 
> As you can see there is a type I should be /testpage/:
> 
> www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512
> -1564/>
> 
> As our Framework doesn't care about the text before the ID, we could type 
> everything we want and the site will be displayed because of the id. That is 
> why both link are fine and there is no 404.
> If I change the link from the mainpage to the correct one, let nutch crawl 
> the site again, an send is to solr, the old one is still found.
> 
> So the link
> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1
> 564/> is still at the nutch db, because the link is valid --> no 404. 
> But there is no mainpage pointing to this website. How do I tell nutch to 
> ignore sites, which doesn't have a link to it.
> Basically --> revalidating links and removing site without links to it?
> 
> 
> 
> Mit freundlichen Grüßen
> David Kumar
> 
> Senior Software Engineer Java, B. Sc.
> Projektmanager PIM
> Abteilung Infotech
> TechniSat Digital GmbH
> Julius-Saxler-Straße 3
> TechniPark
> D-54550 Daun / Germany
> 
> Tel.: + 49 (0) 6592 / 712 -2826
> Fax: + 49 (0) 6592 / 712 -2829
> 
> www.technisat.com/de_DE/<http://www.technisat.com/de_DE/>
> www.facebook.com/technisat
> 
> 

Reply via email to