Re: Issue Crawling Alternate URLs

Sebastian Nagel Thu, 06 Oct 2016 08:26:01 -0700

Hi,

> http://rssfeeds.azcentral.com/phoenix/asu


That's already an RSS feed which unluckily fails to parse:
(using plugin "feed")
 Status: failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid 
XML: Error on line 183:
XML document structures must start and end within the same entity.
(using "parse-tika")
 Caused by: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on 
line 188: XML document
structures must start and end within the same entity.


When opening the URL in a browser (Firefox) the server sends a HTML page.
At least, that's what I got when trying it:

% wget -q -O - http://rssfeeds.azcentral.com/phoenix/asu | head
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" 
href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt";?><rss
xmlns:content="http://purl.org/rss/1.0/modules/content/";  version="2.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0";>
  <channel>
    <title>Phoenix - ASU</title>
    <link>http://api-internal.usatoday.com.akadns.net</link>
    <description>Phoenix - ASU</description>
    <copyright>Copyright 2016, GANNETT</copyright>
    <language>en-us</language>
<item>
<feedburner:origLink>http://www.azcentral.com/story/sports/ncaaf/asu/2016/10/05/arizona-state-football-needs-reignite-run-game-against-ucla/91631636/</feedburner:origLink>


Best,
Sebastian

On 10/05/2016 02:08 PM, Adler, Matthew (US) wrote:
> Hello Nutch Users:
> 
> I’m currently having an issue with Nutch 1.4, similar to the one logged here:
> 
> https://issues.apache.org/jira/browse/NUTCH-2319
> 
> Using the example in that JIRA issue, if I am on the following URL:
> http://rssfeeds.azcentral.com/phoenix/asu
> 
> I expect that nutch will be able to find the alternate linked URL, specified 
> in the following link tag:
> 
> <link rel="alternate" type="application/atom+xml" 
> href="http://rssfeeds.azcentral.com/phoenix/asu&amp;x=1"; title="Phoenix - 
> ASU">
> 
> It does not however, even though I’ve tried to make a few changes to the 
> RegEX in in suffix-urlfilter.txt, regex-normalize.xml, regex-urlfilter.txt, 
> and prefix-urlfilter.txt but have not had any success.
> 
> Any feedback would be appreciated.
> 
> Please let me know,
> 
> MA
> This message contains information which may be confidential and privileged. 
> Unless you are the intended addressee (or authorized to receive for the 
> intended addressee), you may not use, copy or disclose to anyone the message 
> or any information contained in the message. If you have received the message 
> in error, please advise the sender by reply and delete the message.
>

Re: Issue Crawling Alternate URLs

Reply via email to