Any idea as to why? I took the URL for the page directly from a working browser. I can curl the url and that works. Could part of the problem stem from it thinking the encoding is windows-1252, when it is actually UTF-8?
-- Chris On Thu, Dec 15, 2011 at 9:59 AM, Markus Jelsma <[email protected]> wrote: > The page was successfully fetched and parsed but the title just contains: > "ERROR: The requested URL could not be retrieved" as it seems. > > On Thursday 15 December 2011 15:36:40 Christopher Gross wrote: >> I'm getting a success status AND an error message when trying to do a >> parse check. It is a SharePoint site, but this part allows for >> anonymous access -- I can curl the page just fine without having to do >> anything funky. I have a robots.txt in place that allows everyone >> through (it is an internal test site, url has been redacted). Here's >> what I run: >> >> [user@eval bin]$ ./nutch parsechecker "http://sharepointurl/Home.aspx" >> fetching: http://sharepointurl/Home.aspx >> parsing: http://sharepointurl/Home.aspx >> contentType: text/html >> --------- >> Url >> --------------- >> http://http://sharepointurl/Home.aspx--------- >> ParseData >> --------- >> Version: 5 >> Status: success(1,0) >> Title: ERROR: The requested URL could not be retrieved >> Outlinks: 0 >> Content Metadata: Connection=close Content-Type=text/html >> Parse Metadata: CharEncodingForConversion=windows-1252 >> OriginalCharEncoding=windows-1252 >> >> Google searches have been fruitless. Can anyone help me make sense of >> what is going on here? I can provide some snippets of config files if >> need be. >> >> Nutch 1.4, SharePoint 2010, Java 1.6.0_06-b02. >> >> Thanks! >> >> -- Chris > > -- > Markus Jelsma - CTO - Openindex

