Hello,

If en.wikipedia.org is all you are after, enabled urlfilter-domain, add the 
hostname to the domain-urlfilter.txt file and all non-english hyperlinks are 
discarded.

Regards,
Markus

-----Original message-----
> From:Yongyao Jiang <j.yongya...@gmail.com>
> Sent: Thursday 27th July 2017 18:09
> To: user@nutch.apache.org
> Cc: Mcgibbney, Lewis J (398M) <lewis.j.mcgibb...@jpl.nasa.gov>
> Subject: Accept language and url filter not working
> 
> Hi all,
> 
> I am having some issues with the "http.accept.language" and
> "urlfilter-regexā€¯ functions. My goal is to collect only english webpages,
> and disregard all "wikipedia" pages.
> 
> 1. I have added the following content in the nutch-site.xml, but the result
> still contains lots of "zh, ca, fr, etc." In addition, I also changed this
> in nutch-default.xml to be safe. Wonder if I need to add a plugin to the
> nutch-site.xml to do this.
> 
> <property>
>   <name>http.accept.language</name>
>   <value>en-us,en-gb,en</value>
>   <description>Value of the "Accept-Language" request header field.
>   This allows selecting non-English language as default one to retrieve.
>   It is a useful setting for search engines build for certain national
> group.
>   </description>
> </property>
> 
> 2. With respect to the "urlfilter-regex", I have added the following
> configurations in nutch-site.xml and regex-urlfilter.txt.
> 
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-http|*urlfilter-regex*
> |parse-(tika)|index-(anchor|basic|more|static|replace|links)|indexer-elastic|urlnormalizer-basic|scoring-(opic|similarity)|language-identifier|protocol-httpclient</value>
> </property>
> 
> *-^.*wikipedia.*$*
> 
> Thanks,
> Yongyao
> 
> 
> -- 
> Yongyao Jiang
> https://www.linkedin.com/in/yongyao-jiang-42516164
> Ph.D. Student in Earth Systems and GeoInformation Sciences
> NSF Spatiotemporal Innovation Center
> George Mason University
> 

Reply via email to