Hi Manish, On Sat, Dec 12, 2015 at 6:22 AM, <[email protected]> wrote:
> > Ian using notch 1.10, I need to index page locale, I could see there is > plugin available for identifying page language but I need to index locale. > > Well I have a few answers. 1) Take a look at the index-geoip [0] plugin and associated properties within nutch-default [1]. This will provide you with a rich metadata model for indexing all sorts of Geographical information. The downside here though is that this is based off of the IP of the Webserver which we obtain a WebSocket connection. This is not therefore necessarily the Webpage locale. Which is not ideal and which provides no guarantee of satisfying your requirements. 2) Have a look at Apache Any23 [2]. Any23 has extraction capabilities which will pick up Geo coordinate data if it is structured as Markup. You can try out the web service at [3] 3) You can check out the Tika GeoTopicParser [4]. This is a bit bulky right now but may also provide you with interesting results. hth Lewis [0] https://github.com/apache/nutch/tree/trunk/src/plugin/index-geoip [1] https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1482-L1510 [2] http://any23.apache.org [3] http://any23.org [4] https://wiki.apache.org/tika/GeoTopicParser

