RE: Accept language and url filter not working

Markus Jelsma Thu, 27 Jul 2017 09:31:52 -0700

Well, the Accept-Language header does no prevent sites from returning anything 
they like, regardless of your header.


If you want only English, you need a language detector and the HostDB. Using 
readhostdb and a Jexl-expression you can get a list of hosts that are 
non-English. Get that list, and use domainblacklist-urlfilter.txt.

Markus

 
 
-----Original message-----
> From:Yongyao Jiang <[email protected]>
> Sent: Thursday 27th July 2017 18:23
> To: [email protected]
> Subject: Re: Accept language and url filter not working
> 
> Thanks, Markus. Actually, I don't need wikipedia at all even if they are in
> English, so I think this urlfilter-domain won't work.
> 
> Yongyao
> 
> On Thu, Jul 27, 2017 at 12:18 PM, Markus Jelsma <[email protected]>
> wrote:
> 
> > Hello,
> >
> > If en.wikipedia.org is all you are after, enabled urlfilter-domain, add
> > the hostname to the domain-urlfilter.txt file and all non-english
> > hyperlinks are discarded.
> >
> > Regards,
> > Markus
> >
> > -----Original message-----
> > > From:Yongyao Jiang <[email protected]>
> > > Sent: Thursday 27th July 2017 18:09
> > > To: [email protected]
> > > Cc: Mcgibbney, Lewis J (398M) <[email protected]>
> > > Subject: Accept language and url filter not working
> > >
> > > Hi all,
> > >
> > > I am having some issues with the "http.accept.language" and
> > > "urlfilter-regex” functions. My goal is to collect only english webpages,
> > > and disregard all "wikipedia" pages.
> > >
> > > 1. I have added the following content in the nutch-site.xml, but the
> > result
> > > still contains lots of "zh, ca, fr, etc." In addition, I also changed
> > this
> > > in nutch-default.xml to be safe. Wonder if I need to add a plugin to the
> > > nutch-site.xml to do this.
> > >
> > > <property>
> > >   <name>http.accept.language</name>
> > >   <value>en-us,en-gb,en</value>
> > >   <description>Value of the "Accept-Language" request header field.
> > >   This allows selecting non-English language as default one to retrieve.
> > >   It is a useful setting for search engines build for certain national
> > > group.
> > >   </description>
> > > </property>
> > >
> > > 2. With respect to the "urlfilter-regex", I have added the following
> > > configurations in nutch-site.xml and regex-urlfilter.txt.
> > >
> > > <property>
> > >   <name>plugin.includes</name>
> > >   <value>protocol-http|*urlfilter-regex*
> > > |parse-(tika)|index-(anchor|basic|more|static|replace|
> > links)|indexer-elastic|urlnormalizer-basic|scoring-(
> > opic|similarity)|language-identifier|protocol-httpclient</value>
> > > </property>
> > >
> > > *-^.*wikipedia.*$*
> > >
> > > Thanks,
> > > Yongyao
> > >
> > >
> > > --
> > > Yongyao Jiang
> > > https://www.linkedin.com/in/yongyao-jiang-42516164
> > > Ph.D. Student in Earth Systems and GeoInformation Sciences
> > > NSF Spatiotemporal Innovation Center
> > > George Mason University
> > >
> >
> 
> 
> 
> -- 
> Yongyao Jiang
> https://www.linkedin.com/in/yongyao-jiang-42516164
> Ph.D. Student in Earth Systems and GeoInformation Sciences
> NSF Spatiotemporal Innovation Center
> George Mason University
>

RE: Accept language and url filter not working

Reply via email to