Thanks For Replying, My requirement is , I have some pages where there is not language or locale info provides in html tag or there attribute.
I think I got by your earlier reply is based on IP location locale can be calculated but what this is not the case , I mean what if servers are not divided geographycaly and managed centerally. Thanks Manish Verma AML Search +1 669 224 9924 > On Dec 14, 2015, at 10:22 AM, Lewis John Mcgibbney > <[email protected]> wrote: > > Hi Manish, > > On Sat, Dec 12, 2015 at 6:22 AM, <[email protected]> wrote: > >> >> Ian using notch 1.10, I need to index page locale, I could see there is >> plugin available for identifying page language but I need to index locale. >> >> > Well I have a few answers. > 1) Take a look at the index-geoip [0] plugin and associated properties > within nutch-default [1]. This will provide you with a rich metadata model > for indexing all sorts of Geographical information. The downside here > though is that this is based off of the IP of the Webserver which we obtain > a WebSocket connection. This is not therefore necessarily the Webpage > locale. Which is not ideal and which provides no guarantee of satisfying > your requirements. > 2) Have a look at Apache Any23 [2]. Any23 has extraction capabilities which > will pick up Geo coordinate data if it is structured as Markup. You can try > out the web service at [3] > 3) You can check out the Tika GeoTopicParser [4]. This is a bit bulky right > now but may also provide you with interesting results. > > hth > Lewis > > [0] https://github.com/apache/nutch/tree/trunk/src/plugin/index-geoip > [1] > https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1482-L1510 > [2] http://any23.apache.org > [3] http://any23.org > [4] https://wiki.apache.org/tika/GeoTopicParser

