Well, the Accept-Language header does no prevent sites from returning anything they like, regardless of your header.
If you want only English, you need a language detector and the HostDB. Using readhostdb and a Jexl-expression you can get a list of hosts that are non-English. Get that list, and use domainblacklist-urlfilter.txt. Markus -----Original message----- > From:Yongyao Jiang <[email protected]> > Sent: Thursday 27th July 2017 18:23 > To: [email protected] > Subject: Re: Accept language and url filter not working > > Thanks, Markus. Actually, I don't need wikipedia at all even if they are in > English, so I think this urlfilter-domain won't work. > > Yongyao > > On Thu, Jul 27, 2017 at 12:18 PM, Markus Jelsma <[email protected]> > wrote: > > > Hello, > > > > If en.wikipedia.org is all you are after, enabled urlfilter-domain, add > > the hostname to the domain-urlfilter.txt file and all non-english > > hyperlinks are discarded. > > > > Regards, > > Markus > > > > -----Original message----- > > > From:Yongyao Jiang <[email protected]> > > > Sent: Thursday 27th July 2017 18:09 > > > To: [email protected] > > > Cc: Mcgibbney, Lewis J (398M) <[email protected]> > > > Subject: Accept language and url filter not working > > > > > > Hi all, > > > > > > I am having some issues with the "http.accept.language" and > > > "urlfilter-regex” functions. My goal is to collect only english webpages, > > > and disregard all "wikipedia" pages. > > > > > > 1. I have added the following content in the nutch-site.xml, but the > > result > > > still contains lots of "zh, ca, fr, etc." In addition, I also changed > > this > > > in nutch-default.xml to be safe. Wonder if I need to add a plugin to the > > > nutch-site.xml to do this. > > > > > > <property> > > > <name>http.accept.language</name> > > > <value>en-us,en-gb,en</value> > > > <description>Value of the "Accept-Language" request header field. > > > This allows selecting non-English language as default one to retrieve. > > > It is a useful setting for search engines build for certain national > > > group. > > > </description> > > > </property> > > > > > > 2. With respect to the "urlfilter-regex", I have added the following > > > configurations in nutch-site.xml and regex-urlfilter.txt. > > > > > > <property> > > > <name>plugin.includes</name> > > > <value>protocol-http|*urlfilter-regex* > > > |parse-(tika)|index-(anchor|basic|more|static|replace| > > links)|indexer-elastic|urlnormalizer-basic|scoring-( > > opic|similarity)|language-identifier|protocol-httpclient</value> > > > </property> > > > > > > *-^.*wikipedia.*$* > > > > > > Thanks, > > > Yongyao > > > > > > > > > -- > > > Yongyao Jiang > > > https://www.linkedin.com/in/yongyao-jiang-42516164 > > > Ph.D. Student in Earth Systems and GeoInformation Sciences > > > NSF Spatiotemporal Innovation Center > > > George Mason University > > > > > > > > > -- > Yongyao Jiang > https://www.linkedin.com/in/yongyao-jiang-42516164 > Ph.D. Student in Earth Systems and GeoInformation Sciences > NSF Spatiotemporal Innovation Center > George Mason University >

