Any idea as to why?  I took the URL for the page directly from a
working browser.  I can curl the url and that works. Could part of the
problem stem from it thinking the encoding is windows-1252, when it is
actually UTF-8?

-- Chris



On Thu, Dec 15, 2011 at 9:59 AM, Markus Jelsma
<[email protected]> wrote:
> The page was successfully fetched and parsed but the title just contains:
> "ERROR: The requested URL could not be retrieved" as it seems.
>
> On Thursday 15 December 2011 15:36:40 Christopher Gross wrote:
>> I'm getting a success status AND an error message when trying to do a
>> parse check.  It is a SharePoint site, but this part allows for
>> anonymous access -- I can curl the page just fine without having to do
>> anything funky.  I have a robots.txt in place that allows everyone
>> through (it is an internal test site, url has been redacted).  Here's
>> what I run:
>>
>> [user@eval bin]$ ./nutch parsechecker "http://sharepointurl/Home.aspx";
>> fetching: http://sharepointurl/Home.aspx
>> parsing: http://sharepointurl/Home.aspx
>> contentType: text/html
>> ---------
>> Url
>> ---------------
>> http://http://sharepointurl/Home.aspx---------
>> ParseData
>> ---------
>> Version: 5
>> Status: success(1,0)
>> Title: ERROR: The requested URL could not be retrieved
>> Outlinks: 0
>> Content Metadata: Connection=close Content-Type=text/html
>> Parse Metadata: CharEncodingForConversion=windows-1252
>> OriginalCharEncoding=windows-1252
>>
>> Google searches have been fruitless.  Can anyone help me make sense of
>> what is going on here?  I can provide some snippets of config files if
>> need be.
>>
>> Nutch 1.4, SharePoint 2010, Java 1.6.0_06-b02.
>>
>> Thanks!
>>
>> -- Chris
>
> --
> Markus Jelsma - CTO - Openindex

Reply via email to