If you rely on a filter based on TLD's (.in, .com, etc...) you won't get a good result, since the TLD is no guarantee for language, ie. A .com TLD may host websites not only in English but any other conceivable language, a host in France (.fr) may host websites in greek, for example.
In conf/nutch-site.xml: <property> <name>http.accept.language</name> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> <description>Value of the "Accept-Language" request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property> I believe (I'm not sure) this relies on the language code in the HTML header returned by the hosting webserver, so it relies on the author of the website to specify the language, so its not 100% either. I start with a seed file with URL's which I know are in the language I want, but as the crawls grow I start to see docs in other languages (maybe I have not configured this correctly) Personally I would like to reject any document that is not in the language intended, but I haven't gotten to that point. My next step will be to look into the Tika parser supplied with Nutch. My 2 cents, hope it helps! -----Original Message----- From: Talat UYARER [mailto:[email protected]] Sent: Tuesday, October 15, 2013 5:15 PM To: [email protected] Subject: Re: How to Crawl Specific sites Hi, In addition to Markus answer If you dont want to fetch again non Indıan website, You can do it by writing some custom code. Actually We wrote code because of same needs. Normally if your websites mixed, like .com or .in, you dont understand website language from the url. We solve this by writing custom FetchSchedular code. We check their languages in its shouldfetch method. If website language is not allowed. We dont generate again. If you want to wait I will share our code. Talat 15-10-2013 13:36 tarihinde, Markus Jelsma yazdı: > Hi - either by using a language detector that only allows some or all common languages spoken in India or by using a domain URL filter to restrict to the .in domain. > > > -----Original message----- >> From:Jayadeep Reddy <[email protected]> >> Sent: Tuesday 15th October 2013 12:10 >> To: [email protected] >> Subject: How to Crawl Specific sites >> >> How can I index data of only Indian websites >> >> -- >> Jayadeep Reddy.S, >> M.D & C.E.O >> e Health Access Pvt.Ltd >> www.ehealthaccess.com >> Hyderabad-Chennai-Banglore >> http://www.youtube.com/watch?v=0k5LX8mw6Sk >>

