Hello, If en.wikipedia.org is all you are after, enabled urlfilter-domain, add the hostname to the domain-urlfilter.txt file and all non-english hyperlinks are discarded.
Regards, Markus -----Original message----- > From:Yongyao Jiang <j.yongya...@gmail.com> > Sent: Thursday 27th July 2017 18:09 > To: user@nutch.apache.org > Cc: Mcgibbney, Lewis J (398M) <lewis.j.mcgibb...@jpl.nasa.gov> > Subject: Accept language and url filter not working > > Hi all, > > I am having some issues with the "http.accept.language" and > "urlfilter-regexā€¯ functions. My goal is to collect only english webpages, > and disregard all "wikipedia" pages. > > 1. I have added the following content in the nutch-site.xml, but the result > still contains lots of "zh, ca, fr, etc." In addition, I also changed this > in nutch-default.xml to be safe. Wonder if I need to add a plugin to the > nutch-site.xml to do this. > > <property> > <name>http.accept.language</name> > <value>en-us,en-gb,en</value> > <description>Value of the "Accept-Language" request header field. > This allows selecting non-English language as default one to retrieve. > It is a useful setting for search engines build for certain national > group. > </description> > </property> > > 2. With respect to the "urlfilter-regex", I have added the following > configurations in nutch-site.xml and regex-urlfilter.txt. > > <property> > <name>plugin.includes</name> > <value>protocol-http|*urlfilter-regex* > |parse-(tika)|index-(anchor|basic|more|static|replace|links)|indexer-elastic|urlnormalizer-basic|scoring-(opic|similarity)|language-identifier|protocol-httpclient</value> > </property> > > *-^.*wikipedia.*$* > > Thanks, > Yongyao > > > -- > Yongyao Jiang > https://www.linkedin.com/in/yongyao-jiang-42516164 > Ph.D. Student in Earth Systems and GeoInformation Sciences > NSF Spatiotemporal Innovation Center > George Mason University >