Re: Issue Crawling Alternate URLs

Sebastian Nagel Fri, 07 Oct 2016 00:40:28 -0700

Hi Matthew,

afaics, the content delivered to Nutch under the URL


  http://rssfeeds.azcentral.com/phoenix/asu

does not contain the link

  http://rssfeeds.azcentral.com/phoenix/asu&x=1

That's the simple answer. What you see in a browser is often not that what is 
delivered from the
server to a spider. I've tested both Nutch and wget, see below.

Best,
Sebastian


% bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http \
     -verbose http://rssfeeds.azcentral.com/phoenix/asu
Status: success(1), lastModified=0
Content Type: application/rss+xml
Content Length: null
Content:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" 
href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt";?><rss
xmlns:content="http://purl.org/rss/1.0/modules/content/";  version="2.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0";>
  <channel>
    <title>Phoenix - ASU</title>
    <link>http://api-internal.usatoday.com.akadns.net</link>
...

% wget -O azcentral.asu.wget.xml http://rssfeeds.azcentral.com/phoenix/asu
--2016-10-07 09:32:21--  http://rssfeeds.azcentral.com/phoenix/asu
Resolving rssfeeds.azcentral.com (rssfeeds.azcentral.com)... 198.251.67.124, 
198.251.67.127,
198.71.59.197, ...
Connecting to rssfeeds.azcentral.com 
(rssfeeds.azcentral.com)|198.251.67.124|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: ‘azcentral.asu.wget.xml’

azcentral.asu.wget.xml                      [ <=>
                      ] 136.25K  --.-KB/s    in 0.01s

2016-10-07 09:32:23 (11.6 MB/s) - ‘azcentral.asu.wget.xml’ saved [139517]

% grep -F 'http://rssfeeds.azcentral.com/phoenix/asu&x=1' azcentral.asu.wget.xml

(nothing found)


On 10/06/2016 05:37 PM, Adler, Matthew (US) wrote:
> Hi Sebastian:
> 
> You are correct in terms of the first URL, which isn't my issue.  The issue 
> is that if I am attempting to crawl that initial page, 
> http://rssfeeds.azcentral.com/phoenix/asu, I want nutch to find RSS page 
> linked from it, which is this one:
> 
> http://rssfeeds.azcentral.com/phoenix/asu&x=1
> 
> The issue though, is nutch can't seem to find that link.  From what I can 
> tell the reason is due to the structure of the link tag, which is:
> 
> <link rel="alternate" type="application/atom+xml" 
> href="http://rssfeeds.azcentral.com/phoenix/asu&x=1"; title="Phoenix - ASU">
> 
> Please let know if this clarifies the issue.
> 
> Cheers,
> MA
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Thursday, October 06, 2016 8:26 AM
> To: [email protected]
> Subject: Re: Issue Crawling Alternate URLs
> 
> Hi,
> 
>> http://rssfeeds.azcentral.com/phoenix/asu
> 
> That's already an RSS feed which unluckily fails to parse:
> (using plugin "feed")
>  Status: failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid 
> XML: Error on line 183:
> XML document structures must start and end within the same entity.
> (using "parse-tika")
>  Caused by: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on 
> line 188: XML document structures must start and end within the same entity.
> 
> 
> When opening the URL in a browser (Firefox) the server sends a HTML page.
> At least, that's what I got when trying it:
> 
> % wget -q -O - http://rssfeeds.azcentral.com/phoenix/asu | head <?xml 
> version="1.0"?> <?xml-stylesheet type="text/xsl" 
> href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt";?><rss
> xmlns:content="http://purl.org/rss/1.0/modules/content/";  version="2.0"
> xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0";>
>   <channel>
>     <title>Phoenix - ASU</title>
>     <link>http://api-internal.usatoday.com.akadns.net</link>
>     <description>Phoenix - ASU</description>
>     <copyright>Copyright 2016, GANNETT</copyright>
>     <language>en-us</language>
> <item>
> <feedburner:origLink>http://www.azcentral.com/story/sports/ncaaf/asu/2016/10/05/arizona-state-football-needs-reignite-run-game-against-ucla/91631636/</feedburner:origLink>
> 
> 
> Best,
> Sebastian
> 
> On 10/05/2016 02:08 PM, Adler, Matthew (US) wrote:
>> Hello Nutch Users:
>>
>> I’m currently having an issue with Nutch 1.4, similar to the one logged here:
>>
>> https://issues.apache.org/jira/browse/NUTCH-2319
>>
>> Using the example in that JIRA issue, if I am on the following URL:
>> http://rssfeeds.azcentral.com/phoenix/asu
>>
>> I expect that nutch will be able to find the alternate linked URL, specified 
>> in the following link tag:
>>
>> <link rel="alternate" type="application/atom+xml"
>> href="http://rssfeeds.azcentral.com/phoenix/asu&amp;x=1";
>> title="Phoenix - ASU">
>>
>> It does not however, even though I’ve tried to make a few changes to the 
>> RegEX in in suffix-urlfilter.txt, regex-normalize.xml, regex-urlfilter.txt, 
>> and prefix-urlfilter.txt but have not had any success.
>>
>> Any feedback would be appreciated.
>>
>> Please let me know,
>>
>> MA
>> This message contains information which may be confidential and privileged. 
>> Unless you are the intended addressee (or authorized to receive for the 
>> intended addressee), you may not use, copy or disclose to anyone the message 
>> or any information contained in the message. If you have received the 
>> message in error, please advise the sender by reply and delete the message.
>>
> 
> This message contains information which may be confidential and privileged. 
> Unless you are the intended addressee (or authorized to receive for the 
> intended addressee), you may not use, copy or disclose to anyone the message 
> or any information contained in the message. If you have received the message 
> in error, please advise the sender by reply and delete the message.
>

Re: Issue Crawling Alternate URLs

Reply via email to