Thanks, Markus. Actually, I don't need wikipedia at all even if they are in English, so I think this urlfilter-domain won't work.
Yongyao On Thu, Jul 27, 2017 at 12:18 PM, Markus Jelsma <[email protected]> wrote: > Hello, > > If en.wikipedia.org is all you are after, enabled urlfilter-domain, add > the hostname to the domain-urlfilter.txt file and all non-english > hyperlinks are discarded. > > Regards, > Markus > > -----Original message----- > > From:Yongyao Jiang <[email protected]> > > Sent: Thursday 27th July 2017 18:09 > > To: [email protected] > > Cc: Mcgibbney, Lewis J (398M) <[email protected]> > > Subject: Accept language and url filter not working > > > > Hi all, > > > > I am having some issues with the "http.accept.language" and > > "urlfilter-regex” functions. My goal is to collect only english webpages, > > and disregard all "wikipedia" pages. > > > > 1. I have added the following content in the nutch-site.xml, but the > result > > still contains lots of "zh, ca, fr, etc." In addition, I also changed > this > > in nutch-default.xml to be safe. Wonder if I need to add a plugin to the > > nutch-site.xml to do this. > > > > <property> > > <name>http.accept.language</name> > > <value>en-us,en-gb,en</value> > > <description>Value of the "Accept-Language" request header field. > > This allows selecting non-English language as default one to retrieve. > > It is a useful setting for search engines build for certain national > > group. > > </description> > > </property> > > > > 2. With respect to the "urlfilter-regex", I have added the following > > configurations in nutch-site.xml and regex-urlfilter.txt. > > > > <property> > > <name>plugin.includes</name> > > <value>protocol-http|*urlfilter-regex* > > |parse-(tika)|index-(anchor|basic|more|static|replace| > links)|indexer-elastic|urlnormalizer-basic|scoring-( > opic|similarity)|language-identifier|protocol-httpclient</value> > > </property> > > > > *-^.*wikipedia.*$* > > > > Thanks, > > Yongyao > > > > > > -- > > Yongyao Jiang > > https://www.linkedin.com/in/yongyao-jiang-42516164 > > Ph.D. Student in Earth Systems and GeoInformation Sciences > > NSF Spatiotemporal Innovation Center > > George Mason University > > > -- Yongyao Jiang https://www.linkedin.com/in/yongyao-jiang-42516164 Ph.D. Student in Earth Systems and GeoInformation Sciences NSF Spatiotemporal Innovation Center George Mason University

