Ah, check this out: markus@chillout:~$ curl http://lochem.raadsinformatie.nl/robots.txt <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>302 Found</title> </head><body> <h1>Found</h1> <p>The document has moved <a href="http://lochem.raadsinformatie.nl/robots">here</a>.</p> </body></html> markus@chillout:~$ curl http://lochem.raadsinformatie.nl/robots User-agent: * Disallow: /
On Tuesday 30 September 2014 11:13:34 Jigal van Hemert | alterNET internet BV wrote: > Hi, > > 2014-09-17 16:43 GMT+02:00 Jigal van Hemert | alterNET internet BV > > <[email protected]>: > > Hi, > > > > 2014-09-16 16:15 GMT+02:00 Markus Jelsma <[email protected]>: > >> You can check the bin/nutch parsechecker tool to see if the URL's are > >> properly extracted from webpages. Then use the bin/nutch > >> org.apache.nutch.net.URLFilterChecker -allCombined tool to see some > >> filter removes your URL's. They may also be normalized to something > >> undesirable but that's not usually the case.> > > Nice tools! Didn't know about them. > > > > Output from parsechecker: http://pastebin.com/EJYNVuVx > > > > Then the URLFilterChecker: > > > > echo "http://lochem.raadsinformatie.nl/sitemap/meetings/2013/" | > > bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined > > Checking combination of all URLFilters available > > +http://lochem.raadsinformatie.nl/sitemap/meetings/2013/ > > > > Anything strange in this output? > > I did another Nutch configuration the other day and it successfully > indexed a couple of sites. The mystery remains why on the same server > with the same software (just slightly different configuration, but > nothing significant) one set of seed URLs still doesn't want to work. > All checks (see above) seem to work correctly, but after the fetching > part nothing is reported for parsing and subsequent runs just say they > don't have anything to do. > > Any pointers to debugging the fetching process and finding out what is > queued for parsing is highly appreciated.

