Hi, > http://rssfeeds.azcentral.com/phoenix/asu
That's already an RSS feed which unluckily fails to parse: (using plugin "feed") Status: failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 183: XML document structures must start and end within the same entity. (using "parse-tika") Caused by: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 188: XML document structures must start and end within the same entity. When opening the URL in a browser (Firefox) the server sends a HTML page. At least, that's what I got when trying it: % wget -q -O - http://rssfeeds.azcentral.com/phoenix/asu | head <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"> <channel> <title>Phoenix - ASU</title> <link>http://api-internal.usatoday.com.akadns.net</link> <description>Phoenix - ASU</description> <copyright>Copyright 2016, GANNETT</copyright> <language>en-us</language> <item> <feedburner:origLink>http://www.azcentral.com/story/sports/ncaaf/asu/2016/10/05/arizona-state-football-needs-reignite-run-game-against-ucla/91631636/</feedburner:origLink> Best, Sebastian On 10/05/2016 02:08 PM, Adler, Matthew (US) wrote: > Hello Nutch Users: > > I’m currently having an issue with Nutch 1.4, similar to the one logged here: > > https://issues.apache.org/jira/browse/NUTCH-2319 > > Using the example in that JIRA issue, if I am on the following URL: > http://rssfeeds.azcentral.com/phoenix/asu > > I expect that nutch will be able to find the alternate linked URL, specified > in the following link tag: > > <link rel="alternate" type="application/atom+xml" > href="http://rssfeeds.azcentral.com/phoenix/asu&x=1" title="Phoenix - > ASU"> > > It does not however, even though I’ve tried to make a few changes to the > RegEX in in suffix-urlfilter.txt, regex-normalize.xml, regex-urlfilter.txt, > and prefix-urlfilter.txt but have not had any success. > > Any feedback would be appreciated. > > Please let me know, > > MA > This message contains information which may be confidential and privileged. > Unless you are the intended addressee (or authorized to receive for the > intended addressee), you may not use, copy or disclose to anyone the message > or any information contained in the message. If you have received the message > in error, please advise the sender by reply and delete the message. >

