RE: rewriting urls that are index

Markus Jelsma Mon, 22 Apr 2013 08:57:27 -0700

Hi,

The 1.x indexer takes a -normalize parameter and there you can rewrite your 
URL's. Judging from your patterns the RegexURLNormalizer should be sufficient. 
Make sure you use the config file containing that pattern only when indexing, 
otherwise they'll end up in the CrawlDB and segments. Use 
urlnormalizer.regex.file to specifiy the file or pass patterns directly using 
urlnormalizer.regex.rules.


Cheers,
Markus
 
 
-----Original message-----
> From:Niels Boldt <[email protected]>
> Sent: Mon 22-Apr-2013 15:56
> To: [email protected]
> Subject: rewriting urls that are index
> 
> Hi,
> 
> We are crawling a site using nutch 1.6 and indexing into solr.
> 
> However, we need to rewrite the urls that are indexed in the following way
> 
> For instance, nutch crawls a page http://www.example.com/article=xxx but
> when moving data to the index we would like to use the url
> 
> http://www.example.com/kb#article=xxx <http://www.example.com/article=xxx>
> 
> Instead. So when we get data from solr it will show links to
> http://www.example.com/kb#article=xxx
> <http://www.example.com/article=xxx> instead
> of http://www.example.com/article=xxx
> 
> Is that possible to do by creating a plugin that extends the UrlNormalizer,
> eg
> 
> http://nutch.apache.org/apidocs-1.4/org/apache/nutch/net/URLNormalizer.html
> 
> Or is it better to add a new indexed property that we use.
> 
> Best Regards
> Niels
>

RE: rewriting urls that are index

Reply via email to