RE: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag
Hi Lewis, It seems that URLs get mangled when message posted to email list. The seed URL I that used was for MSNBC dot COM: http---www-msnbc-com (replace dashes with ":", "/", and ".") Regards, Vyacheslav Pascarel -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Thursday, June 22, 2017 2:11 PM To: user@nutch.apache.org Subject: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag Hi Vyacheslav, Can you provide me and example page with http refresh tag included? I'll try comparing behaviour between 2.X and master. Thank you Lewis On Sat, Jun 17, 2017 at 9:25 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Vyacheslav Pascarel <vpasc...@opentext.com> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Fri, 16 Jun 2017 13:18:16 +0000 > Subject: RE: [EXTERNAL] - Re: Outlinks field is not populated when > page from seed URL when fetched page contains "refresh" meta tag It is > 2.3.1. > >
Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag
Hi Vyacheslav, Can you provide me and example page with http refresh tag included? I'll try comparing behaviour between 2.X and master. Thank you Lewis On Sat, Jun 17, 2017 at 9:25 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Vyacheslav Pascarel <vpasc...@opentext.com> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Fri, 16 Jun 2017 13:18:16 +0000 > Subject: RE: [EXTERNAL] - Re: Outlinks field is not populated when page > from seed URL when fetched page contains "refresh" meta tag > It is 2.3.1. > >
RE: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag
It is 2.3.1. Vyacheslav Pascarel -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Thursday, June 15, 2017 11:23 PM To: user@nutch.apache.org Subject: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag Hi Vyacheslav, On Thu, Jun 15, 2017 at 1:41 AM, <user-digest-h...@nutch.apache.org> wrote: > > From: Vyacheslav Pascarel <vpasc...@opentext.com> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Wed, 14 Jun 2017 22:15:49 +0000 > Subject: Outlinks field is not populated when page from seed URL when > fetched page contains "refresh" meta tag Hello, > > I am trying to crawl > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.msnbc.com_=D > wIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=XeO6ShRDVKU6HktuQu5d6DHtkdlyuxMSWDVUj > -ZGQKE=y4ak_4BuvKZMwom9X3QBIzAMVLnasMYLebPs0Evj-vk=HjlDXsBCmcJ9B2SZh5E05oDyZfRHKu3rrUcCL1hd0JA= > but having problem to get anything else beside the original seed URL. The > INJECT/GENERATE/FETCH steps complete without problems but after executing > PARSE I see only one outlink pointing to the original seed URL: > > ... Which version of Nutch are you using? Lewis
Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag
Hi Vyacheslav, On Thu, Jun 15, 2017 at 1:41 AM, <user-digest-h...@nutch.apache.org> wrote: > > From: Vyacheslav Pascarel <vpasc...@opentext.com> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Wed, 14 Jun 2017 22:15:49 +0000 > Subject: Outlinks field is not populated when page from seed URL when > fetched page contains "refresh" meta tag > Hello, > > I am trying to crawl http://www.msnbc.com/ but having problem to get > anything else beside the original seed URL. The INJECT/GENERATE/FETCH steps > complete without problems but after executing PARSE I see only one outlink > pointing to the original seed URL: > > ... Which version of Nutch are you using? Lewis
Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag
Hello, I am trying to crawl http://www.msnbc.com/ but having problem to get anything else beside the original seed URL. The INJECT/GENERATE/FETCH steps complete without problems but after executing PARSE I see only one outlink pointing to the original seed URL: "outlinks" : { "http://www*msnbc*com/; : "" } Executing "bin/nutch parsechecker http://www.msnbc.com; found ~130 outlinks. I removed results of the crawling and repeated the steps but run PARSE in debugger. Here are my observations: 1. The seed URL page contains http://www.msnbc.com/; /> 2. During HtmlParser.getParse() meta tag attributes are extracted and instance of HTMLMetaTags object is created 3. HtmlParser.getParse() sets major code of ParseStatus to ParseStatusCodes.SUCCESS 4. HtmlParser.getParse() sets minor code of ParseStatus to ParseStatusCodes.SUCCESS_REDIRECT based on the presence of "refresh" in HTMLMetaTags object. 5. Upon successful parsing, ParseUtil.process() generates one new "http://www*msnbc*com/; outlink and ignores ~130 discovered due to pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT Based on the above, it seems like there is a loop created since "re-fetch" will return "refresh" again and again. Here are snippets of Nutch code mentioned: HtmlParser.getParse () ... ParseStatus status = ParseStatus.newBuilder().build(); status.setMajorCode((int) ParseStatusCodes.SUCCESS); if (metaTags.getRefresh()) { status.setMinorCode((int) ParseStatusCodes.SUCCESS_REDIRECT); status.getArgs().add(new Utf8(metaTags.getRefreshHref().toString())); status.getArgs().add( new Utf8(Integer.toString(metaTags.getRefreshTime(; } ... ParseUtil.process() ... if (ParseStatusUtils.isSuccess(pstatus)) { if (pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) { String newUrl = ParseStatusUtils.getMessage(pstatus); int refreshTime = Integer.parseInt(ParseStatusUtils.getArg(pstatus, 1)); try { newUrl = normalizers.normalize(newUrl, URLNormalizers.SCOPE_FETCHER); if (newUrl == null) { LOG.warn("redirect normalized to null " + url); return; } try { newUrl = filters.filter(newUrl); } catch (URLFilterException e) { return; } if (newUrl == null) { LOG.warn("redirect filtered to null " + url); return; } } catch (MalformedURLException e) { LOG.warn("malformed url exception parsing redirect " + url); return; } page.getOutlinks().put(new Utf8(newUrl), new Utf8()); page.getMetadata().put(FetcherJob.REDIRECT_DISCOVERED, TableUtil.YES_VAL); if (newUrl == null || newUrl.equals(url)) { String reprUrl = URLUtil.chooseRepr(url, newUrl, refreshTime < FetcherJob.PERM_REFRESH_TIME); if (reprUrl == null) { LOG.warn("reprUrl==null for " + url); return; } else { page.setReprUrl(new Utf8(reprUrl)); } } } else { page.setText(new Utf8(parse.getText())); page.setTitle(new Utf8(parse.getTitle())); ByteBuffer prevSig = page.getSignature(); if (prevSig != null) { page.setPrevSignature(prevSig); } final byte[] signature = sig.calculate(page); page.setSignature(ByteBuffer.wrap(signature)); if (page.getOutlinks() != null) { page.getOutlinks().clear(); } final Outlink[] outlinks = parse.getOutlinks(); int outlinksToStore = Math.min(maxOutlinks, outlinks.length); String fromHost; if (ignoreExternalLinks) { try { fromHost = new URL(url).getHost().toLowerCase(); } catch (final MalformedURLException e) { fromHost = null; } } else { fromHost = null; } int validCount = 0; for (int i = 0; validCount < outlinksToStore && i < outlinks.length; i++) { String toUrl = outlinks[i].getToUrl(); try { toUrl = normalizers.normalize(toUrl, URLNormalizers.SCOPE_OUTLINK); toUrl = filters.filter(toUrl); } catch (MalformedURLException e2) { continue; } catch (URLFilterException e) { continue; } if (toUrl == null) { continue; } Utf8 utf8ToUrl = new Utf8(toUrl); if (page.getOutlinks().get(utf8ToUrl) != null) { // skip duplicate outlinks continue; } String toHost; if (ignoreExternalLinks) { try { toHost = new URL(toUrl).getHost().toLowerCase(); } catch (final MalformedURLException e) { toHost = null; } if (toHost == null ||