Hello Manish! 1. Not really except with custom intervention. The protocol-htmlunit plugin won't work either because it follows redirects at all levels, so it would index the same content twice. Unfortunately, it must follow redirects to work because assets as JS and CSS must follow redirects as well. Simple window.location=bla can be easily detected in a custom parse filter, and then set a flag to skip indexing. The case with htmlunit is not fixed yet, but should be doable considering the extensibility of htmlunit. But it has a major performance drawback and requires a forward http proxy to cache assets, avoiding continously redownloading infrequently changing assets all the time.
2. Not by default, but maybe with Boilerpipe enabled but i am not sure. Otherwise it is straightforward to patch parse-tika to emit the title tag by configuration. We rely on custom extractors that make this configurable, but on the web, anything is possible and will screw things up. Good night! Markus -----Original message----- > From:Manish Verma <[email protected]> > Sent: Tuesday 5th July 2016 21:52 > To: [email protected] > Subject: Nutch Redirect Skip Indexing Orignal Url > > Hi, > > Nutch 1.12 Url redirect scenario - > #1 Is there any way to skip original url from getting indexed ? I see when > page has JS redirects nutch create docs for both original and redirected page. > #2 I see page title is becoming part of page content, is it configurable to > exclude title from content ? > > Regards, > Manish Verma > AML Search > >

