RE: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-22 Thread Vyacheslav Pascarel
Hi Lewis,

It seems that URLs get mangled when message posted to email list. The seed URL 
I that used was  for MSNBC dot COM: 

http---www-msnbc-com  (replace dashes with ":", "/", and ".")

Regards,

Vyacheslav Pascarel


-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Thursday, June 22, 2017 2:11 PM
To: user@nutch.apache.org
Subject: [EXTERNAL] - Re: Outlinks field is not populated when page from seed 
URL when fetched page contains "refresh" meta tag

Hi Vyacheslav,
Can you provide me and example page with http refresh tag included? I'll try 
comparing behaviour between 2.X and master.
Thank you
Lewis

On Sat, Jun 17, 2017 at 9:25 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Vyacheslav Pascarel <vpasc...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Fri, 16 Jun 2017 13:18:16 +0000
> Subject: RE: [EXTERNAL] - Re: Outlinks field is not populated when 
> page from seed URL when fetched page contains "refresh" meta tag It is 
> 2.3.1.
>
>


Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-22 Thread lewis john mcgibbney
Hi Vyacheslav,
Can you provide me and example page with http refresh tag included? I'll
try comparing behaviour between 2.X and master.
Thank you
Lewis

On Sat, Jun 17, 2017 at 9:25 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Vyacheslav Pascarel <vpasc...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Fri, 16 Jun 2017 13:18:16 +0000
> Subject: RE: [EXTERNAL] - Re: Outlinks field is not populated when page
> from seed URL when fetched page contains "refresh" meta tag
> It is 2.3.1.
>
>


RE: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-16 Thread Vyacheslav Pascarel
It is 2.3.1. 

Vyacheslav Pascarel


-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Thursday, June 15, 2017 11:23 PM
To: user@nutch.apache.org
Subject: [EXTERNAL] - Re: Outlinks field is not populated when page from seed 
URL when fetched page contains "refresh" meta tag

Hi Vyacheslav,

On Thu, Jun 15, 2017 at 1:41 AM, <user-digest-h...@nutch.apache.org> wrote:

>
> From: Vyacheslav Pascarel <vpasc...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Wed, 14 Jun 2017 22:15:49 +
> Subject: Outlinks field is not populated when page from seed URL when 
> fetched page contains "refresh" meta tag Hello,
>
> I am trying to crawl 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.msnbc.com_=D
> wIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=XeO6ShRDVKU6HktuQu5d6DHtkdlyuxMSWDVUj
> -ZGQKE=y4ak_4BuvKZMwom9X3QBIzAMVLnasMYLebPs0Evj-vk=HjlDXsBCmcJ9B2SZh5E05oDyZfRHKu3rrUcCL1hd0JA=
>   but having problem to get anything else beside the original seed URL. The 
> INJECT/GENERATE/FETCH steps complete without problems but after executing 
> PARSE I see only one outlink pointing to the original seed URL:
>
> ...
Which version of Nutch are you using?
Lewis


Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-15 Thread lewis john mcgibbney
Hi Vyacheslav,

On Thu, Jun 15, 2017 at 1:41 AM,  wrote:

>
> From: Vyacheslav Pascarel 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Wed, 14 Jun 2017 22:15:49 +
> Subject: Outlinks field is not populated when page from seed URL when
> fetched page contains "refresh" meta tag
> Hello,
>
> I am trying to crawl http://www.msnbc.com/ but having problem to get
> anything else beside the original seed URL. The INJECT/GENERATE/FETCH steps
> complete without problems but after executing PARSE I see only one outlink
> pointing to the original seed URL:
>
> ...
Which version of Nutch are you using?
Lewis