You can curl it from the same machine you run Nutch on? It is not a Nutch error, the error is embedded in the title by your webserver.
On Thursday 15 December 2011 16:07:11 Christopher Gross wrote: > Any idea as to why? I took the URL for the page directly from a > working browser. I can curl the url and that works. Could part of the > problem stem from it thinking the encoding is windows-1252, when it is > actually UTF-8? > > -- Chris > > > > On Thu, Dec 15, 2011 at 9:59 AM, Markus Jelsma > > <[email protected]> wrote: > > The page was successfully fetched and parsed but the title just contains: > > "ERROR: The requested URL could not be retrieved" as it seems. > > > > On Thursday 15 December 2011 15:36:40 Christopher Gross wrote: > >> I'm getting a success status AND an error message when trying to do a > >> parse check. It is a SharePoint site, but this part allows for > >> anonymous access -- I can curl the page just fine without having to do > >> anything funky. I have a robots.txt in place that allows everyone > >> through (it is an internal test site, url has been redacted). Here's > >> what I run: > >> > >> [user@eval bin]$ ./nutch parsechecker "http://sharepointurl/Home.aspx" > >> fetching: http://sharepointurl/Home.aspx > >> parsing: http://sharepointurl/Home.aspx > >> contentType: text/html > >> --------- > >> Url > >> --------------- > >> http://http://sharepointurl/Home.aspx--------- > >> ParseData > >> --------- > >> Version: 5 > >> Status: success(1,0) > >> Title: ERROR: The requested URL could not be retrieved > >> Outlinks: 0 > >> Content Metadata: Connection=close Content-Type=text/html > >> Parse Metadata: CharEncodingForConversion=windows-1252 > >> OriginalCharEncoding=windows-1252 > >> > >> Google searches have been fruitless. Can anyone help me make sense of > >> what is going on here? I can provide some snippets of config files if > >> need be. > >> > >> Nutch 1.4, SharePoint 2010, Java 1.6.0_06-b02. > >> > >> Thanks! > >> > >> -- Chris > > > > -- > > Markus Jelsma - CTO - Openindex -- Markus Jelsma - CTO - Openindex

