You can curl it from the same machine you run Nutch on? It is not a Nutch 
error, the error is embedded in the title by your webserver.

On Thursday 15 December 2011 16:07:11 Christopher Gross wrote:
> Any idea as to why?  I took the URL for the page directly from a
> working browser.  I can curl the url and that works. Could part of the
> problem stem from it thinking the encoding is windows-1252, when it is
> actually UTF-8?
> 
> -- Chris
> 
> 
> 
> On Thu, Dec 15, 2011 at 9:59 AM, Markus Jelsma
> 
> <[email protected]> wrote:
> > The page was successfully fetched and parsed but the title just contains:
> > "ERROR: The requested URL could not be retrieved" as it seems.
> > 
> > On Thursday 15 December 2011 15:36:40 Christopher Gross wrote:
> >> I'm getting a success status AND an error message when trying to do a
> >> parse check.  It is a SharePoint site, but this part allows for
> >> anonymous access -- I can curl the page just fine without having to do
> >> anything funky.  I have a robots.txt in place that allows everyone
> >> through (it is an internal test site, url has been redacted).  Here's
> >> what I run:
> >> 
> >> [user@eval bin]$ ./nutch parsechecker "http://sharepointurl/Home.aspx";
> >> fetching: http://sharepointurl/Home.aspx
> >> parsing: http://sharepointurl/Home.aspx
> >> contentType: text/html
> >> ---------
> >> Url
> >> ---------------
> >> http://http://sharepointurl/Home.aspx---------
> >> ParseData
> >> ---------
> >> Version: 5
> >> Status: success(1,0)
> >> Title: ERROR: The requested URL could not be retrieved
> >> Outlinks: 0
> >> Content Metadata: Connection=close Content-Type=text/html
> >> Parse Metadata: CharEncodingForConversion=windows-1252
> >> OriginalCharEncoding=windows-1252
> >> 
> >> Google searches have been fruitless.  Can anyone help me make sense of
> >> what is going on here?  I can provide some snippets of config files if
> >> need be.
> >> 
> >> Nutch 1.4, SharePoint 2010, Java 1.6.0_06-b02.
> >> 
> >> Thanks!
> >> 
> >> -- Chris
> > 
> > --
> > Markus Jelsma - CTO - Openindex

-- 
Markus Jelsma - CTO - Openindex

Reply via email to