Re: Redirection behavior

prateek Thu, 06 May 2021 03:42:58 -0700

Thanks.. I am using a custom http plugin. So I will debug with 1.16 to see
what's causing it. Thanks for your help


Regards
Prateek

On Thu, May 6, 2021 at 11:26 AM Sebastian Nagel <wastl.na...@googlemail.com>
wrote:

> Hi Prateek,
>
> (sorry, I pressed the wrong reply button, so redirecting the discussion
> back to user@nutch)
>
>
>  > I am not sure what I am missing.
>
> Well, URL filters?  Robots.txt?  Don't know...
>
>
>  > I am currently using Nutch 1.16
>
> Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1])
> which caused Fetcher
> not to follow redirects. But it was fixed already in Nutch 1.15.
>
> I've retried using Nutch 1.16:
> - using -Dplugin.includes='protocol-okhttp|parse-html'
>     FetcherThread 43 fetching http://wikipedia.com/ (queue crawl
> delay=3000ms)
>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
> delay=3000ms)
>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
> delay=3000ms)
>
> Note: there might be an issue using protocol-http
> (-Dplugin.includes='protocol-http|parse-html')
> together with Nutch 1.16:
>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
> delay=3000ms)
>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
> delay=3000ms)
>     Couldn't get robots.txt for https://wikipedia.com/:
> java.net.SocketException: Socket is closed
>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
> delay=3000ms)
>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
> delay=3000ms)
>     Couldn't get robots.txt for https://www.wikipedia.org/:
> java.net.SocketException: Socket is closed
>     Failed to get protocol output java.net.SocketException: Socket is
> closed
>          at
> sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109)
>          at
> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:162)
>          at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
>          at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375)
>          at
> org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
>     FetcherThread 43 fetch of https://www.wikipedia.org/ failed with:
> java.net.SocketException: Socket is closed
>
> But it's not reproducible using Nutch master / 1.18 - as it relates to
> HTTPS/SSL it's likely fixed by NUTCH-2794 [2].
>
> In any case, could you try to reproduce the problem using Nutch 1.18 ?
>
> Best,
> Sebastian
>
> [1] https://issues.apache.org/jira/browse/NUTCH-2550
> [2] https://issues.apache.org/jira/browse/NUTCH-2794
>
>
> On 5/6/21 11:54 AM, prateek wrote:
> > Thanks for your reply Sebastian.
> >
> > I am using http.redirect.max=5 for my setup.
> > In the seed URL, I am only passing http://wikipedia.com/ <
> http://wikipedia.com/> and https://zyfro.com/ <https://zyfro.com/> .
> CrawlDatum
> > and ParseData shared in my earlier email are from http://wikipedia.com/
> <http://wikipedia.com/> url.
> > I don't see the other redirected URL's in the logs or segments. Here is
> my log -
> >
> > /2021-05-05 17:35:23,854 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 1 Using queue mode :
> byHost
> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher:
> Fetcher: throughput threshold: -1
> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher:
> Fetcher: throughput threshold retries: 5
> > *2021-05-05 17:35:23,855 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching
> http://wikipedia.com/
> > <http://wikipedia.com/> (queue crawl delay=1000ms)*
> >
> > *2021-05-05 17:35:29,095 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching
> https://zyfro.com/
> > <https://zyfro.com/> (queue crawl delay=1000ms)*
> > 2021-05-05 17:35:29,095 INFO [main] com.**.nutchplugin.http.Http:
> fetching https://zyfro.com/robots.txt <https://zyfro.com/robots.txt>
> > 2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher:
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
> > fetchQueues.getQueueCount=1
> > 2021-05-05 17:35:30,189 INFO [main] com.**.nutchplugin.http.Http:
> fetching https://zyfro.com/ <https://zyfro.com/>
> > 2021-05-05 17:35:30,786 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 has no more work
> available/
> >
> > I am not sure what I am missing.
> >
> > Regards
> > Prateek
> >
> >
> > On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel <
> wastl.na...@googlemail.com <mailto:wastl.na...@googlemail.com>> wrote:
> >
> >     Hi Prateek,
> >
> >     could you share information about all pages/URLs in the redirect
> chain?
> >
> >     http://wikipedia.com/ <http://wikipedia.com/>
> >     https://wikipedia.com/ <https://wikipedia.com/>
> >     https://www.wikipedia.org/ <https://www.wikipedia.org/>
> >
> >     If I'm not wrong, the shown  CrawlDatum and ParseData stems from
> >     https://www.wikipedia.org/ <https://www.wikipedia.org/> and is
> _http_status_code_=200.
> >     So, looks like the redirects have been followed.
> >
> >     Note: all 3 URLs should have records in the segment and the CrawlDb.
> >
> >     I've also verified that the above redirect chain is followed by
> Fetcher
> >     with the following settings (passed on the command-line via -D) using
> >     Nutch master (1.18):
> >        -Dhttp.redirect.max=3
> >        -Ddb.ignore.external.links=true
> >        -Ddb.ignore.external.links.mode=byDomain
> >        -Ddb.ignore.also.redirects=false
> >
> >     Fetcher log snippets:
> >        FetcherThread 51 fetching http://wikipedia.com/ <
> http://wikipedia.com/> (queue crawl delay=3000ms)
> >        FetcherThread 51 fetching https://wikipedia.com/ <
> https://wikipedia.com/> (queue crawl delay=3000ms)
> >        FetcherThread 51 fetching https://www.wikipedia.org/ <
> https://www.wikipedia.org/> (queue crawl delay=3000ms)
> >
> >     Just in case: what's the value of the property http.redirect.max ?
> >
> >     Best,
> >     Sebastian
> >
> >
> >     On 5/5/21 8:09 PM, prateek wrote:
> >      > Hi,
> >      >
> >      > I am currently using Nutch 1.16 with the properties below -
> >      >
> >      >
> >      >
> >      >
> *db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false*
> >      >
> >      > When I am crawling websites that are redirecting (301 http code)
> using
> >      > Nutch (for example - https://zyfro.com/ <https://zyfro.com/> and
> http://wikipedia.com/ <http://wikipedia.com/>). I see
> >      > that the new redirected URL is not captured by nutch. Even the
> outlinks
> >      > point to the original url provided and status returned is 200.
> >      > So my question is
> >      > 1. How do I capture the new URL?
> >      > 2. Is there a way to allow nutch to capture 301 status and then
> the new url
> >      > and then crawl the related content?
> >      >
> >      > Here is CrawlDatum and Parsedata structure for
> http://wikipedia.com/ <http://wikipedia.com/> which
> >      > gets redirected to wikipedia.org <http://wikipedia.org>.
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      >
> >      > *CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time: Wed
> May 05
> >      > 17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC
> 1970Retries since
> >      > fetch: 0Retry interval: 31536000 seconds (365 days)Score:
> 2.0Signature:
> >      > nullMetadata:   _ngt_=1620235730883 _depth_=1
> _http_status_code_=200
> >      > _pst_=success(1), lastModified=1620038693000 _rs_=410
> >      > Content-Type=text/html _maxdepth_=1000
> nutch.protocol.code=200ParseData :
> >      > Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1
> outlink: toUrl:
> >      >
> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
> >     <
> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
> >
> >      > <
> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
> >     <
> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
> >>
> >      > anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8
> >      > nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449
> >      > Server-Timing=cache;desc="hit-front", host;desc="cp1081"
> >      > Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May
> 2021
> >      > 10:44:53 GMT Strict-Transport-Security=max-age=106384710;
> >      > includeSubDomains; preload X-Cache-Status=hit-front Report-To={
> "group":
> >      > "wm_nel", "max_age": 86400, "endpoints": [{ "url":
> >      >
> >     "
> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
> >     <
> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
> >
> >      >
> >     <
> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
> >     <
> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
> >>"
> >      > }] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081
> hit/578233
> >      > Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114
> Date=Wed,
> >      > 05 May 2021 09:51:42 GMT nutch.crawl.score=2.0 Accept-Ranges=bytes
> >      > nutch.segment.name <http://nutch.segment.name> <
> http://nutch.segment.name <http://nutch.segment.name>>=20210505173059
> >      > Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={
> >      > "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05,
> >      > "success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068"
> Vary=Accept-Encoding
> >      >
> >
>  
> X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671
> |dst:www.wikipedia.org
> >     <http://www.wikipedia.org>
> >      > <http://www.wikipedia.org 
> > <http://www.wikipedia.org>>|principal:hadoop-test
> _fst_=33 Parse Metadata:
> >      > CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
> _depth_=1
> >      > viewport=initial-scale=1,user-scalable=yes
> metatag.description=Wikipedia is
> >      > a free online encyclopedia, created and edited by volunteers
> around the
> >      > world and hosted by the Wikimedia Foundation.
> metatag.description=Wikipedia
> >      > is a free online encyclopedia, created and edited by volunteers
> around the
> >      > world and hosted by the Wikimedia Foundation.
> description=Wikipedia is a
> >      > free online encyclopedia, created and edited by volunteers around
> the world
> >      > and hosted by the Wikimedia Foundation. _maxdepth_=1000 *
> >      >
> >      >
> >      > Thanks
> >      > Prateek
> >      >
> >
>
>

Re: Redirection behavior

Reply via email to