Thanks.. I am using a custom http plugin. So I will debug with 1.16 to see what's causing it. Thanks for your help
Regards Prateek On Thu, May 6, 2021 at 11:26 AM Sebastian Nagel <wastl.na...@googlemail.com> wrote: > Hi Prateek, > > (sorry, I pressed the wrong reply button, so redirecting the discussion > back to user@nutch) > > > > I am not sure what I am missing. > > Well, URL filters? Robots.txt? Don't know... > > > > I am currently using Nutch 1.16 > > Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1]) > which caused Fetcher > not to follow redirects. But it was fixed already in Nutch 1.15. > > I've retried using Nutch 1.16: > - using -Dplugin.includes='protocol-okhttp|parse-html' > FetcherThread 43 fetching http://wikipedia.com/ (queue crawl > delay=3000ms) > FetcherThread 43 fetching https://wikipedia.com/ (queue crawl > delay=3000ms) > FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl > delay=3000ms) > > Note: there might be an issue using protocol-http > (-Dplugin.includes='protocol-http|parse-html') > together with Nutch 1.16: > FetcherThread 43 fetching https://wikipedia.com/ (queue crawl > delay=3000ms) > FetcherThread 43 fetching https://wikipedia.com/ (queue crawl > delay=3000ms) > Couldn't get robots.txt for https://wikipedia.com/: > java.net.SocketException: Socket is closed > FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl > delay=3000ms) > FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl > delay=3000ms) > Couldn't get robots.txt for https://www.wikipedia.org/: > java.net.SocketException: Socket is closed > Failed to get protocol output java.net.SocketException: Socket is > closed > at > sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109) > at > org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:162) > at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375) > at > org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343) > FetcherThread 43 fetch of https://www.wikipedia.org/ failed with: > java.net.SocketException: Socket is closed > > But it's not reproducible using Nutch master / 1.18 - as it relates to > HTTPS/SSL it's likely fixed by NUTCH-2794 [2]. > > In any case, could you try to reproduce the problem using Nutch 1.18 ? > > Best, > Sebastian > > [1] https://issues.apache.org/jira/browse/NUTCH-2550 > [2] https://issues.apache.org/jira/browse/NUTCH-2794 > > > On 5/6/21 11:54 AM, prateek wrote: > > Thanks for your reply Sebastian. > > > > I am using http.redirect.max=5 for my setup. > > In the seed URL, I am only passing http://wikipedia.com/ < > http://wikipedia.com/> and https://zyfro.com/ <https://zyfro.com/> . > CrawlDatum > > and ParseData shared in my earlier email are from http://wikipedia.com/ > <http://wikipedia.com/> url. > > I don't see the other redirected URL's in the logs or segments. Here is > my log - > > > > /2021-05-05 17:35:23,854 INFO [main] > org.apache.nutch.fetcher.FetcherThread: FetcherThread 1 Using queue mode : > byHost > > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: > Fetcher: throughput threshold: -1 > > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: > Fetcher: throughput threshold retries: 5 > > *2021-05-05 17:35:23,855 INFO [main] > org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching > http://wikipedia.com/ > > <http://wikipedia.com/> (queue crawl delay=1000ms)* > > > > *2021-05-05 17:35:29,095 INFO [main] > org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching > https://zyfro.com/ > > <https://zyfro.com/> (queue crawl delay=1000ms)* > > 2021-05-05 17:35:29,095 INFO [main] com.**.nutchplugin.http.Http: > fetching https://zyfro.com/robots.txt <https://zyfro.com/robots.txt> > > 2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher: > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, > > fetchQueues.getQueueCount=1 > > 2021-05-05 17:35:30,189 INFO [main] com.**.nutchplugin.http.Http: > fetching https://zyfro.com/ <https://zyfro.com/> > > 2021-05-05 17:35:30,786 INFO [main] > org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 has no more work > available/ > > > > I am not sure what I am missing. > > > > Regards > > Prateek > > > > > > On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel < > wastl.na...@googlemail.com <mailto:wastl.na...@googlemail.com>> wrote: > > > > Hi Prateek, > > > > could you share information about all pages/URLs in the redirect > chain? > > > > http://wikipedia.com/ <http://wikipedia.com/> > > https://wikipedia.com/ <https://wikipedia.com/> > > https://www.wikipedia.org/ <https://www.wikipedia.org/> > > > > If I'm not wrong, the shown CrawlDatum and ParseData stems from > > https://www.wikipedia.org/ <https://www.wikipedia.org/> and is > _http_status_code_=200. > > So, looks like the redirects have been followed. > > > > Note: all 3 URLs should have records in the segment and the CrawlDb. > > > > I've also verified that the above redirect chain is followed by > Fetcher > > with the following settings (passed on the command-line via -D) using > > Nutch master (1.18): > > -Dhttp.redirect.max=3 > > -Ddb.ignore.external.links=true > > -Ddb.ignore.external.links.mode=byDomain > > -Ddb.ignore.also.redirects=false > > > > Fetcher log snippets: > > FetcherThread 51 fetching http://wikipedia.com/ < > http://wikipedia.com/> (queue crawl delay=3000ms) > > FetcherThread 51 fetching https://wikipedia.com/ < > https://wikipedia.com/> (queue crawl delay=3000ms) > > FetcherThread 51 fetching https://www.wikipedia.org/ < > https://www.wikipedia.org/> (queue crawl delay=3000ms) > > > > Just in case: what's the value of the property http.redirect.max ? > > > > Best, > > Sebastian > > > > > > On 5/5/21 8:09 PM, prateek wrote: > > > Hi, > > > > > > I am currently using Nutch 1.16 with the properties below - > > > > > > > > > > > > > *db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false* > > > > > > When I am crawling websites that are redirecting (301 http code) > using > > > Nutch (for example - https://zyfro.com/ <https://zyfro.com/> and > http://wikipedia.com/ <http://wikipedia.com/>). I see > > > that the new redirected URL is not captured by nutch. Even the > outlinks > > > point to the original url provided and status returned is 200. > > > So my question is > > > 1. How do I capture the new URL? > > > 2. Is there a way to allow nutch to capture 301 status and then > the new url > > > and then crawl the related content? > > > > > > Here is CrawlDatum and Parsedata structure for > http://wikipedia.com/ <http://wikipedia.com/> which > > > gets redirected to wikipedia.org <http://wikipedia.org>. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time: Wed > May 05 > > > 17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC > 1970Retries since > > > fetch: 0Retry interval: 31536000 seconds (365 days)Score: > 2.0Signature: > > > nullMetadata: _ngt_=1620235730883 _depth_=1 > _http_status_code_=200 > > > _pst_=success(1), lastModified=1620038693000 _rs_=410 > > > Content-Type=text/html _maxdepth_=1000 > nutch.protocol.code=200ParseData : > > > Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1 > outlink: toUrl: > > > > http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png > > < > http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png > > > > > < > http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png > > < > http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png > >> > > > anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8 > > > nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449 > > > Server-Timing=cache;desc="hit-front", host;desc="cp1081" > > > Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May > 2021 > > > 10:44:53 GMT Strict-Transport-Security=max-age=106384710; > > > includeSubDomains; preload X-Cache-Status=hit-front Report-To={ > "group": > > > "wm_nel", "max_age": 86400, "endpoints": [{ "url": > > > > > " > https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0 > > < > https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0 > > > > > > > < > https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0 > > < > https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0 > >>" > > > }] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081 > hit/578233 > > > Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114 > Date=Wed, > > > 05 May 2021 09:51:42 GMT nutch.crawl.score=2.0 Accept-Ranges=bytes > > > nutch.segment.name <http://nutch.segment.name> < > http://nutch.segment.name <http://nutch.segment.name>>=20210505173059 > > > Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={ > > > "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, > > > "success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068" > Vary=Accept-Encoding > > > > > > > X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671 > |dst:www.wikipedia.org > > <http://www.wikipedia.org> > > > <http://www.wikipedia.org > > <http://www.wikipedia.org>>|principal:hadoop-test > _fst_=33 Parse Metadata: > > > CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 > _depth_=1 > > > viewport=initial-scale=1,user-scalable=yes > metatag.description=Wikipedia is > > > a free online encyclopedia, created and edited by volunteers > around the > > > world and hosted by the Wikimedia Foundation. > metatag.description=Wikipedia > > > is a free online encyclopedia, created and edited by volunteers > around the > > > world and hosted by the Wikimedia Foundation. > description=Wikipedia is a > > > free online encyclopedia, created and edited by volunteers around > the world > > > and hosted by the Wikimedia Foundation. _maxdepth_=1000 * > > > > > > > > > Thanks > > > Prateek > > > > > > >