Hi Matthew, afaics, the content delivered to Nutch under the URL
http://rssfeeds.azcentral.com/phoenix/asu does not contain the link http://rssfeeds.azcentral.com/phoenix/asu&x=1 That's the simple answer. What you see in a browser is often not that what is delivered from the server to a spider. I've tested both Nutch and wget, see below. Best, Sebastian % bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http \ -verbose http://rssfeeds.azcentral.com/phoenix/asu Status: success(1), lastModified=0 Content Type: application/rss+xml Content Length: null Content: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"> <channel> <title>Phoenix - ASU</title> <link>http://api-internal.usatoday.com.akadns.net</link> ... % wget -O azcentral.asu.wget.xml http://rssfeeds.azcentral.com/phoenix/asu --2016-10-07 09:32:21-- http://rssfeeds.azcentral.com/phoenix/asu Resolving rssfeeds.azcentral.com (rssfeeds.azcentral.com)... 198.251.67.124, 198.251.67.127, 198.71.59.197, ... Connecting to rssfeeds.azcentral.com (rssfeeds.azcentral.com)|198.251.67.124|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/xml] Saving to: ‘azcentral.asu.wget.xml’ azcentral.asu.wget.xml [ <=> ] 136.25K --.-KB/s in 0.01s 2016-10-07 09:32:23 (11.6 MB/s) - ‘azcentral.asu.wget.xml’ saved [139517] % grep -F 'http://rssfeeds.azcentral.com/phoenix/asu&x=1' azcentral.asu.wget.xml (nothing found) On 10/06/2016 05:37 PM, Adler, Matthew (US) wrote: > Hi Sebastian: > > You are correct in terms of the first URL, which isn't my issue. The issue > is that if I am attempting to crawl that initial page, > http://rssfeeds.azcentral.com/phoenix/asu, I want nutch to find RSS page > linked from it, which is this one: > > http://rssfeeds.azcentral.com/phoenix/asu&x=1 > > The issue though, is nutch can't seem to find that link. From what I can > tell the reason is due to the structure of the link tag, which is: > > <link rel="alternate" type="application/atom+xml" > href="http://rssfeeds.azcentral.com/phoenix/asu&x=1" title="Phoenix - ASU"> > > Please let know if this clarifies the issue. > > Cheers, > MA > > -----Original Message----- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Sent: Thursday, October 06, 2016 8:26 AM > To: user@nutch.apache.org > Subject: Re: Issue Crawling Alternate URLs > > Hi, > >> http://rssfeeds.azcentral.com/phoenix/asu > > That's already an RSS feed which unluckily fails to parse: > (using plugin "feed") > Status: failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid > XML: Error on line 183: > XML document structures must start and end within the same entity. > (using "parse-tika") > Caused by: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on > line 188: XML document structures must start and end within the same entity. > > > When opening the URL in a browser (Firefox) the server sends a HTML page. > At least, that's what I got when trying it: > > % wget -q -O - http://rssfeeds.azcentral.com/phoenix/asu | head <?xml > version="1.0"?> <?xml-stylesheet type="text/xsl" > href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt"?><rss > xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0" > xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"> > <channel> > <title>Phoenix - ASU</title> > <link>http://api-internal.usatoday.com.akadns.net</link> > <description>Phoenix - ASU</description> > <copyright>Copyright 2016, GANNETT</copyright> > <language>en-us</language> > <item> > <feedburner:origLink>http://www.azcentral.com/story/sports/ncaaf/asu/2016/10/05/arizona-state-football-needs-reignite-run-game-against-ucla/91631636/</feedburner:origLink> > > > Best, > Sebastian > > On 10/05/2016 02:08 PM, Adler, Matthew (US) wrote: >> Hello Nutch Users: >> >> I’m currently having an issue with Nutch 1.4, similar to the one logged here: >> >> https://issues.apache.org/jira/browse/NUTCH-2319 >> >> Using the example in that JIRA issue, if I am on the following URL: >> http://rssfeeds.azcentral.com/phoenix/asu >> >> I expect that nutch will be able to find the alternate linked URL, specified >> in the following link tag: >> >> <link rel="alternate" type="application/atom+xml" >> href="http://rssfeeds.azcentral.com/phoenix/asu&x=1" >> title="Phoenix - ASU"> >> >> It does not however, even though I’ve tried to make a few changes to the >> RegEX in in suffix-urlfilter.txt, regex-normalize.xml, regex-urlfilter.txt, >> and prefix-urlfilter.txt but have not had any success. >> >> Any feedback would be appreciated. >> >> Please let me know, >> >> MA >> This message contains information which may be confidential and privileged. >> Unless you are the intended addressee (or authorized to receive for the >> intended addressee), you may not use, copy or disclose to anyone the message >> or any information contained in the message. If you have received the >> message in error, please advise the sender by reply and delete the message. >> > > This message contains information which may be confidential and privileged. > Unless you are the intended addressee (or authorized to receive for the > intended addressee), you may not use, copy or disclose to anyone the message > or any information contained in the message. If you have received the message > in error, please advise the sender by reply and delete the message. >