Ah, check this out:

markus@chillout:~$ curl http://lochem.raadsinformatie.nl/robots.txt
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a 
href="http://lochem.raadsinformatie.nl/robots";>here</a>.</p>
</body></html>
markus@chillout:~$ curl http://lochem.raadsinformatie.nl/robots
User-agent: *
Disallow: /



On Tuesday 30 September 2014 11:13:34 Jigal van Hemert | alterNET internet BV 
wrote:
> Hi,
> 
> 2014-09-17 16:43 GMT+02:00 Jigal van Hemert | alterNET internet BV
> 
> <[email protected]>:
> > Hi,
> > 
> > 2014-09-16 16:15 GMT+02:00 Markus Jelsma <[email protected]>:
> >> You can check the bin/nutch parsechecker tool to see if the URL's are
> >> properly extracted from webpages. Then use the bin/nutch
> >> org.apache.nutch.net.URLFilterChecker -allCombined tool to see some
> >> filter removes your URL's. They may also be normalized to something
> >> undesirable but that's not usually the case.> 
> > Nice tools! Didn't know about them.
> > 
> > Output from parsechecker: http://pastebin.com/EJYNVuVx
> > 
> > Then the URLFilterChecker:
> > 
> > echo "http://lochem.raadsinformatie.nl/sitemap/meetings/2013/"; |
> > bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
> > Checking combination of all URLFilters available
> > +http://lochem.raadsinformatie.nl/sitemap/meetings/2013/
> > 
> > Anything strange in this output?
> 
> I did another Nutch configuration the other day and it successfully
> indexed a couple of sites. The mystery remains why on the same server
> with the same software (just slightly different configuration, but
> nothing significant) one set of seed URLs still doesn't want to work.
> All checks (see above) seem to work correctly, but after the fetching
> part nothing is reported for parsing and subsequent runs just say they
> don't have anything to do.
> 
> Any pointers to debugging the fetching process and finding out what is
> queued for parsing is highly appreciated.

Reply via email to