RE: Nutch Redirect Skip Indexing Orignal Url

Markus Jelsma Tue, 05 Jul 2016 15:03:19 -0700

Hello Manish!

1. Not really except with custom intervention. The protocol-htmlunit plugin 
won't work either because it follows redirects at all levels, so it would index 
the same content twice. Unfortunately, it must follow redirects to work because 
assets as JS and CSS must follow redirects as well. Simple window.location=bla 
can be easily detected in a custom parse filter, and then set a flag to skip 
indexing. The case with htmlunit is not fixed yet, but should be doable 
considering the extensibility of htmlunit. But it has a major performance 
drawback and requires a forward http proxy to cache assets, avoiding 
continously redownloading infrequently changing assets all the time.


2. Not by default, but maybe with Boilerpipe enabled but i am not sure. 
Otherwise it is straightforward to patch parse-tika to emit the title tag by 
configuration. We rely on custom extractors that make this configurable, but on 
the web, anything is possible and will screw things up.

Good night!
Markus 
 
-----Original message-----
> From:Manish Verma <[email protected]>
> Sent: Tuesday 5th July 2016 21:52
> To: [email protected]
> Subject: Nutch Redirect Skip Indexing Orignal Url
> 
> Hi,
> 
> Nutch 1.12 Url redirect scenario - 
> #1 Is there any way to skip original url from getting indexed ? I see when 
> page has JS redirects nutch create docs for both original and redirected page.
> #2 I see page  title is becoming part of page content, is it configurable to 
> exclude title from content ?
> 
> Regards,
> Manish Verma
> AML Search
> 
>

RE: Nutch Redirect Skip Indexing Orignal Url

Reply via email to