RE: Issue Crawling Alternate URLs

Adler, Matthew (US) Thu, 06 Oct 2016 08:37:54 -0700

Hi Sebastian:

You are correct in terms of the first URL, which isn't my issue.  The issue is 
that if I am attempting to crawl that initial page, 
http://rssfeeds.azcentral.com/phoenix/asu, I want nutch to find RSS page linked 
from it, which is this one:

http://rssfeeds.azcentral.com/phoenix/asu&x=1

The issue though, is nutch can't seem to find that link.  From what I can tell 
the reason is due to the structure of the link tag, which is:

<link rel="alternate" type="application/atom+xml" 
href="http://rssfeeds.azcentral.com/phoenix/asu&x=1"; title="Phoenix - ASU">

Please let know if this clarifies the issue.

Cheers,
MA

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]]
Sent: Thursday, October 06, 2016 8:26 AM
To: [email protected]
Subject: Re: Issue Crawling Alternate URLs

Hi,

> http://rssfeeds.azcentral.com/phoenix/asu

That's already an RSS feed which unluckily fails to parse:
(using plugin "feed")
 Status: failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid 
XML: Error on line 183:
XML document structures must start and end within the same entity.
(using "parse-tika")
 Caused by: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on 
line 188: XML document structures must start and end within the same entity.

When opening the URL in a browser (Firefox) the server sends a HTML page.
At least, that's what I got when trying it:

% wget -q -O - http://rssfeeds.azcentral.com/phoenix/asu | head <?xml 
version="1.0"?> <?xml-stylesheet type="text/xsl" 
href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt";?><rss
xmlns:content="http://purl.org/rss/1.0/modules/content/";  version="2.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0";>
  <channel>
    <title>Phoenix - ASU</title>
    <link>http://api-internal.usatoday.com.akadns.net</link>
    <description>Phoenix - ASU</description>
    <copyright>Copyright 2016, GANNETT</copyright>
    <language>en-us</language>
<item>
<feedburner:origLink>http://www.azcentral.com/story/sports/ncaaf/asu/2016/10/05/arizona-state-football-needs-reignite-run-game-against-ucla/91631636/</feedburner:origLink>

Best,
Sebastian

On 10/05/2016 02:08 PM, Adler, Matthew (US) wrote:
> Hello Nutch Users:
>
> I’m currently having an issue with Nutch 1.4, similar to the one logged here:
>
> https://issues.apache.org/jira/browse/NUTCH-2319
>
> Using the example in that JIRA issue, if I am on the following URL:
> http://rssfeeds.azcentral.com/phoenix/asu
>
> I expect that nutch will be able to find the alternate linked URL, specified 
> in the following link tag:
>
> <link rel="alternate" type="application/atom+xml"
> href="http://rssfeeds.azcentral.com/phoenix/asu&amp;x=1";
> title="Phoenix - ASU">
>
> It does not however, even though I’ve tried to make a few changes to the 
> RegEX in in suffix-urlfilter.txt, regex-normalize.xml, regex-urlfilter.txt, 
> and prefix-urlfilter.txt but have not had any success.
>
> Any feedback would be appreciated.
>
> Please let me know,
>
> MA
> This message contains information which may be confidential and privileged. 
> Unless you are the intended addressee (or authorized to receive for the 
> intended addressee), you may not use, copy or disclose to anyone the message 
> or any information contained in the message. If you have received the message 
> in error, please advise the sender by reply and delete the message.
>

This message contains information which may be confidential and privileged. 
Unless you are the intended addressee (or authorized to receive for the 
intended addressee), you may not use, copy or disclose to anyone the message or 
any information contained in the message. If you have received the message in 
error, please advise the sender by reply and delete the message.

RE: Issue Crawling Alternate URLs

Reply via email to